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Some Proiflising Early Results frona a Rud Iroentary 
Later *:-Trait Theory of Performance Rating 
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University of Arkansas for Wedical Sciences 



A theory in which observed performance ratings are derived 
from the distance bett^een a rater reference point and 
subject performance point located on a postulat.ed 
equal-interval scale and a postulated s-shaped rater 
characteristic curve^^ operationallzed as the noroial ogive^ 
is presented. Least-squares estimates of rater CnR=47^ 31^ 
and 29) and subject (nS=:29^ 30^ and 35) points were 
determined separately on each of three junior medical 
student cohorts* data* The proposed model fit each of the 
data sets better than two alternative models (r>0o'?0> 
p<0. 01). test-re test reliabilities for rater parameters 
were r<0«29 (joint p<0»04). Cross-validating results also 
supported the theory. 
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Ust5ally we EBUst rely upon rhe judgement of hutnan raters 
to assessor i»e«^ to measure and evaluate^ cojaplex human 
performance and products^ In this coatext/^ "measure" Tneans 
a systematic procedure which assigns nufiiber3 (e«g«, scores, 
ratings> the values of which represent hoi much of sorae 
att£ ibut3, charactetlstiCdf or factor Is present* 
"Evaluattion" SReans the determination of i^aaerit or adequacy, 
he relr upon human Judgement to assess performances as 
varied as (a) conducting a cross-examination in a trial 
court, Cfai diagnosing a patient's medical problem^ and (c) 
landing a high-performance aircrafto Also, human judgement 
is fundamental to the assessi^tent of such products as (a) an 
article submitted for publication, (b) the prototype of an 
implantable mechanical heart, and (c) the design plans for a 
neu mousetrap or orbital shuttlecraf t. 

Ihe research reported here is concerned with improving 
ratings-based measures of hu^an performance* Our interest 
in the probleas associated with ratings arose in the context 
of health professions education* Specifically, we were 
interested in improving the assessment of student 
achieveisent in real or high-fidelity simulated practice 
settings, that is, assessment of their clinical performance^ 
Clinical performance appears to be almost archetypical of 
complex performance in a complex setting* Ue shall 
explicitly address only the restricted domain of health 
professionals' clinical performance* Nevertheless, the 
discussion has direct implications for other areas which 
share the comuon elements of reliance upon rater judgement 
and the assesssent of something that is intrinsically 
complex* 

Because the membership of AERA Division I (Professions 
EduCf^tion) is guite heterogeneous and at the specific 
request of two of the reviewers of our paper proposal, we 
first provide z fairly discursive conceptual. Intuitive 
discussion of factors affecting rating reliability and 
validity* The rating process is presented in contrast to 
the objective testing process because the fundamentals of 
test design and analysis concepts and statistics are fairly 
broadly understood in the division^ Latent-trait theory is 
then introduced in the same way: first, as it applies to 
objectively scored tests; then, we present our proposed 
latent-trait theory of performance rating and a simplified 

Paper presented at the Annual Meeting of the American 
Educational Research Association, Los Angeles^ 198i» 
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sodel of it. Th'B balance rff the pa.? ex ?rss=si ts th£=i specif Ic 
research objacrives^ m ^nod^ rsailts;^ aiscusdirn and 
conclusioiis frass eiupirical 3sts tsnr xirdTmentari tziieory. 
Briefly^ we ioand what h^: consider sabsts^tial support for 
otir proposed leodsl uhere i t xsay be apBxogrxBtely appliled- 

One can get; reliable :r^d valid ratings-- based i^asures 
of complex fcusjan perfor^sr?ce msing a very fey «ell tixalned 
raters or by averaging acra^ a larger niiniber of less well 
trained raters if all of t&era rated all 3ubj^ac±3 under 
controlled circuiastances* What thaa currsat state--of— rrhe-art 
does not provide is a useful «ay lo extract reliable^^ valid 
ratings froa the kind of dirty and! irjcc^lete dats sets 
ordinarily available. Dirty rating daxa produced fay lack 
of control uhich permits extraneous f ac±or^ Xo inf luen=::e the 
ratings given* Such things as inadequate xater trziiningjf 
poorly validated rating procedures and xonss^ variabiXity in 
conditions under uhich perf orraaisce zts siHiiad all tend to 
produce dirty data« Incomplete data seiis -spg:- those In which 
not all raters rate all subjects* 

Any significant steps toward the ra^^slsrtlon of this 
problem uould have itamediate beneficial effects in the 
practical evaluation of cocaplex performance in ordinary 
settings and in research in uMch complex performance is a 
variable of interest. 



Sorae Factors Affecting Reliability and Validity 

No measurement^ whether a test score or rating^ may be 
more valid than it is reliable. Reliability sets an upper 
limit on the potential validity of the measure. Neither 
individual items nor individual raters are perfectly 
reliable jseasureaent instruments in the sense of being 
completely accurate^ stable^ and consistent* In classical 
test theory and traditional practice^ an Individual item's 
reliability is measured by either its mean correlation with 
all other iteiss on the test or its cor relatioifx with the 
total test score. Both of these give essentially the saae 
result and are equivalent to the test item's expected 
correlation uith another randomly chosen single item from 
the same content domain. Depending upon the calculationai 
procedure used^ an Individual item reliability may be called 
a correlation of some kind or a discrimination index. 
Siroilarlyj^ tte reliability of the ratings given by a single 
Judge is equal to the expected value of the correlation 
between this judge's ratings and the ratings of another 
independent^ randomly chosen qualified judge. Two 
strategies^ separately or in combination^ may be used to 
improve the reliability of either a rating or a test score: 
use more or use better* 
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Spearmsn-Brown's test reliability forciiila was first 
deirelcTped to provide an estiraate ot hou much the reliability 
cz a tresfs total score uould be changed by adding or 
deleting test items. Reraracrs^ Shocks and Kelly (1927) 
deiaonstrated that pooling (i«e«^ sumciing or averaging) 
ratings across raters Cuhere all or representative subsets 
o£ raters ratts all subjects) had the saise effect as pooling 
the item scores on an objective test. This raeans 
Spearman-Broan^'s forcnula is equally applicable to both itess 
on tests and ratings provided by Independent raters. Figure 
1 depicts the relationship defined ijy Spear man^-Brown's 
f oriQUla betueen the reliability of the total score^ 
reliability of individual itera scores or ratingSir and the 
total nuiinber of independent, items or ratings pooled (i.e./ 
summed or averaged) together^ 

FISURE 1. RELlABILmr AS FUNCTION OF OBSERVATIONS: 
ITEflS OR RATINtK 
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NUMBER OF INDEPENDENT OBSERVATIONS 

Onder ordinary ""real uorld*^ circumstances roost ratings 
are- obtained where many or all of the following conditionr; 
prevail: (a) raters have had no systeysatic training in 
rating based upon the use of standard stimuli and corrective 
feedback; (b) raters receive no or little inforiBatlon 
regarding how other raters rate the same subject under 
equivalent circumstances; Cc) the scales used are vaguely 
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defined as are the meanings of the individual point values 
or scale categories; (d) different raters do not observe the 
same performance under the same conditions; (e) not all 
subjects are rated by ail r<iters^ frequently none of the 
subjects are rated by all raters; (f) not all raters rate 
all subjects^ frequently no rater rates all subjects; aii^d 
(g> subsets of raters are not representative of the rater 
pool« On the basis of empirical evidence^ Syiaonds (1931) 
concluded that under these kinds of ordinary circumstances 
the correlation between independent pairs of raters (i«e-^ 
the reliability of a single rater) is typically around 
r-0»55» The region between the upper tuo lines in Figure 1 
approKimates the reliability of pooled ratings as a function 
of the nuisber of independent ratings under typical 
conditions (assusing all or representative subsets of raters 
rate each subject}^ Clearly one way to improve the overall 
reliability of either a rating or a test score is to base it 
upon jnore ratings or test items® 

Alternatively^ the reliability of each individual test 
item or rating may be improved« In testing practice^ this 
is accomplished by selecting only those individual items 
which have had reliabilities above a specified value when 
used in earlier administrations of the test« ^unnally 
(1967) suggests a minimum individual itesa reliability of 
between r^^O^lO and r=0.20« Hhen this rule is used on the 
typical classrooQi objective test^ the mean individual item 
reliability generally falls between r=0o20 and r^0*30« The 
lower two curves in Figure 1 define the region of expected 
total test score reliability. as a function of (a) typical 
average itew reliability and (b) the number of preselected 
items the test contains. Selecting the most reliable raters 
may. occasionally be heloful; buti^ under typical 
circumstances more is gained frois pooling across all 
available rating data rather than discarding the least 
reliable and pooling the remainder® 

Efforts to improve the reliability of individual rater 
Judgements (and thereby the reliability of the individual 
rater) are generally directed towards eliminating the 
conditions (oescribed above) under which ratings tend to be 
made in real world settings. Frequently^ they rely upon 
techniques such as improving the precision of the 
definitions of the attribute to be rated and values on the 
scale. Often this is Implemented in the form of a 
behaviorally anchored rating (BAR) scale (Smith and Kendall^ 
1963; Landy and Barnes^ 1979). However^ when BAR scales are 
used in otherwise typical rating circumstances there is a 
dearth of data indicating any improvement over non-BAR 
scales. For €J<ample^ Davidge^ Davis, and Hull (1980; also 
in Dielraan, Hull, and Davis, 1980) report full scale 
interrater reliabilties for individual house officers 
(residents) of r=0.61 and for individual attending (faculty) 
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physicians of r=C..41* Ds^idge^r et al«*s results are for the 
use of a very ccarefully desinned BAR scale for measuring 
medical students' ciinical perf oxraance. The reliabilities 
straddle Syiaond^'s (1931) valu€£ fox reliabilities obtained 
under typical (non-BAHS) rating cnadltionso He obtained a 
mean interrat^r :rBliabili*:y fnr an individual rater of 
r-0«50 across atteii^dings and rssliisnts at two training sites 
who used a non-EAR scalia invenrtcry to rate the clinical 
performance of Juirior ye^ iiedic3l students (Cason and 
Cason^ 1979)- In t:^e same paper (Csson and Cason^ 1979)^ we 
concluded that in nnst of t'me publislied literature on rating 
health care professionals^ clinical performance, the single 
factor most inf luenczing tie reported reliability of the 
total rating was the :nmiber of Independent raters across 
uhon it uas sunmed or avex^ed- 

A BAR scale used in tmn junction with rigorous rater 
training can improve ralaax reliability over the value of 
r=0c55 reported by Syroonis (1931) for typical rating 
circumstances* Stilliaan (P.980) has achieved interratcr 
reliabilities of r=0*85 intra^-rater reliabilities of 

r=0«90« Stillfflan obtarlzi^d these results using the 
behaviorally anchored, Bsipirically validated Arizona 
Clinical Interview RatiB^g Scale in conjunction «lth rater 
training. The rater tralBing uas based upon use of standard 
stimuli (video tapes of interviews) and informative feedback 
to the rater« The program has proved successful in training 
raters belonging to three distinct groups: physicians^ 
nurse practitioners, and •^programmed patients"* Stillraan's 
results are directly attributable to her prograni*s success 
in eliminating many oi the conditions found in typical 
rating settings. khlle there are obvious practical 
obstacles to emulating Stillman''s approach, her results 
provide a good benchmark for uhat can be accomplished (at 
least in some settings) when sufficient interest, skill, and 
resources are available. 

It has long been acknowledged In both the folklore and 
research literature relating to rating that raters may vary: 
in their general tendency to be stringent or lenient. This 
variation can affect reliability. EbeJL (1951) has suggested 
two ways of applying Snedecor^'s (1946) ( intraclass) 
reliability formula depending on whether variations in rater 
leniency could affect the stability of subject's mean 
(across raters) ratings* The first method applies uhen all 
raters rate all subjects. When there is variation In rater 
leniency, the first jmethod yields a higher value thaii does 
the second rrethod^ This first method ignores any 
differences between the means of ratings given by different 
raters in the same uay as does an ordinary (Pearson 
product-moment) correlation coefficient. For exaaple, if 
rater A assigned 
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2s 6^ 5^ 3^ 4i 
to f 3^ • subjects in succes:si:3on, and raxer B assigned 

5^ 9^ 3^ 6^ 

to the sane five subi sts xated in the same order^ the 
correlation between re — siings is r=^l«CO^ Yet^ rater B is 
systematically more £2:inie:£:!st than rater A. Shen all (or 
representative subsets ©C) raters rate all s^ubjects there Is 
no systematic effecxt :J rater leniency on individual 
sub3ect*s mean ratSisvv By contrast, sjhen subjects are 
rated by different (::icii--ii apresentative) sublets of raters, 
the aean of the obsiHr^d ratings on each :^ubject is a less 
accurate measure of :::rbe ^ ibject^s perforraanxe because some 
subjects are rated by .more lenient grcup of raters than 
are other subjects Ihn second inethoc for estimating 
Interrater rellabilxt:y stggested by Ebel, unlike an ordinary 
correlation coefficient, "takes into account: differences in 
rater leniency arad ::i3us yields a sinaller and more 
appropriate reliability valuea 

Ifaere is no s^hoctsge of evidence that different 
categories of hea2Sh pirof esslonals vary in their leniency 
Hhen called upon to :rate ^he sasse perfor&ance under ordinary 
(i«eo, poorl:y controlled) conditions« For example, ratings 
of Junior medical st::::2dents by residents (house staff) have 
been consistently and widely reported to be more lenient 
than are those gi^en by faculty (attending) physicians 
(Printen, Chappell, and Whitney, 1973; O'Donohue and Hergin, 
1978^ Pierliorii, Clark, and Dudding, 1979; Cason and Cason, 
1979> Oielman^ Hull, and Davis^ 1980) • The same studies 
also indicate the presence of variation in the leniency of 
raters in the sane category^ 

Exeigplary programs such as Stillnan's can sometimes 
reduce variations in rater leniency to the point uhere it is 
no longer of practical importance as a source of inaccuracy 
in ratings (Stillman^ Broun, Redfield, and Sabers, 1977; 
Sabers, 1981: personal CQiBmunication}« Nevertheless, &ihen 
Heskauskas and iiorcini ( 1980) discuss the problem of 
variability in rater leniency, in both standards setting and 
rating performance, they suggest the need to go beyond the 
things found in progratss such as Stillnan'^s. Heskauskas and 
Norclni suggest that in both standards setting and 
performance rating Judges'' ratings should be "handicapped" 
(ioe«>, corrected or adjusted) for variation in the judges* 
leniency by applying i&ethods presented by Stanley (1961)* 
Heskauskas and Norcini appear to be implying that it is at 
least difficult if not impossible to reduce rater leniency, 
variation belou the level of practical concern entirely 
through the use of BAR scales in conjunction uith rater 
training. 
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Stanley^'s (1961) methods alloH one to both determine 
the extent of variation in rater leniency and develop 
correction foraulas for each rater* Stanley^'s 
analyris-of-variance related procedures allou the 
deterslnation of the separate contribution of rater leniency 
and subject performance to the variation In the observec 
rating data. Hoiiever^ Stanley'^s procedures may be applied 
only when all raters have rated all subjectSir l«e«# to data 
sets with no alssing data. But^ as Stanley points out Cane 
as uas implied above in discussing Ebel's procedures) if all 
raters have rated all subjects^ there is no need for 
adjusting the ratings^ When all raters have rated all 
subjects^ the aean or sum of the rau ratings on any subject 
is as valid and reliable as can be produced by any 
adjustment for rater leniency* Although correction formulas 
for raters developed at one time (uhen all raters rated ail 
subjects) might be used later when subjects were rated i>r 
only (potentially non-representative) subsets of rater^n^ 
this would be defensible only after it had been demonstrated 
that individual raters' relative leniency remained stable 
over tiiae. 

In summary/' If one desires to obtain a highly reliatLle 
and valid assessment of a complex human performance basisd 
upon ratings from human judges^ the current 
state-of-the-art# as suggested in the literature reviewed 
above# indicates that a model assessBient program would 
include: (a) carefully trained raters; (b) empirically, 
validated^ behaviorally anchored sr.ales; (c) controlled^ 
uniform conditions under which performance is observed and 
rated; (d) multiple raters for each subject; (e) all raters 
(or representative subsets of raters) rate all subjects; and / 
(f) use of the mean rating (across raters) obtained by a 
subject as the best available measure of the subjects true 
performance. In actual settings most of these, conditions 
are hard to satisfy^ Raving more raters per subject (d) can 
be used to offset shortcomings in conditions '^a** through "c*^ 
but only if condition "e** is satisfied^ Otherwise 
variations in rater leniency will lower the reliability and 
validity of the outcome^ However in practice/^ condition "e'* 
is frequently not satisfied. 

Although the theory we set forth below was neither 
derived from nor motivated by the applications of 
latent-trait theory to objective testing^ we have 
discovered/ with the benefit of hindsight^ that our theory 
is most easily grasped by someone already familiar with the 
general scheaia of latent- trait theory as applied to 
objective testing* Consonant with the expository strategy 
used above/ we have chosen to begin with the more familiar 
ground of testing^ ther go on to our theory of performance 
rating<r/ 
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Latent-Tnslr :^heory 

Latent-Strait test score tiieory (Lord, 1952^ 1953; 
Bakerj^ 1977; Hambleton, SuajniiJiHthan^ Cook, Cignor, and 
Gifford, 1978) proposes to accmint for. the score on an 
individual test item of :Bn 'individual person« In the 
theory^'s simplest form, the 3>r^Jiiability that the person ui 11 
ansuer an iten correctly is d&terinlned by tuo factors^ the 
person's true ability and the item^'s intrinsic difficulty. 
Item difficulty and Persom ability are both assumed to 
reflect the operation of some underlying (i.e., not directly 
observable, therefore latent) trait, attribute, or factor; 
for exanple, the attribute of knouledge. A person uith much 
knowledge uoiild be located high on the latent knowledge 
scale. Similarly, an item requiring great knowledge to be 
correctly answered would be located high on the knouledge 
scale. The probability that a person of a given ability 
will correctly answer an item of a given difficulty is 
defined by an •*s-3haped'' item characteristic curve. Figure 
2 gives hypothetical characteristic curves for items A and 
B. By convention, Itee A is said to have difficulty K or to 



Figure 2. Characteristic Curves 




K L 
Ability 



be loc^fted at point K on the latent scale. A person with 
ability K (i.e., located at point X) has a 0.50 probability 
of correctly anstiering item A« The item characteristic 
("s-shaped") curve defines the exact relationship between a 
person's ability (at any point on the latent scale) and that 
person's probability of correctly answering that item. 
Consider Figure 2: a person of ab.ility K has a near 2ero 
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probability of answering itea B. Hbile a person uith 
ability L has near a 1.0 probability of answering Iteig A 
correctly^ this person^'s probability of answering itea B 
correctly is only 0«50« 

probably the greatest number of latent^trait theory 
applications have been based upon the Rasch (1966) 
measurement model* This may be largely attributed to the 
work of Ben Vright and his colleagues (Kright^ 1968; Kright 
and Stone, 19*79; Mead, Wright, and Bell, 1979) such as their 
development of techniques, including computer programs, 
uhich make Rasch analysis easier; as well as, their zealous 
advocacy of Sasch useasurement techniques* The defining 
characteristics of the Rasch model are (a) only one 
parameter, location on the latent-trait, is used to 
characterize each person or item; and, (b) the "s-shaped" 
item characteristic curve is operationally defined by the 
logistic function. Other models of latent-trait test theory 
include additional factors (e.g., item discrimination, a 
guessing factor, and so forth) in their characterization of 
test items and people and/or define the characteristic curve 
using a different mathematical function, .e.g., the normal 
ogive. 

Irrespective of what particular model of latent-trait 
theory is used, the usefulness of the model rests upon the 
(testable) assumption of parameter invariance. In contrast 
to conventional te«t item statistics (e^g., difficulty index 
and discrimination index) and norm-referenced test scores, 
the parameter values for item difficulty and person ability 
are independent of the context of both the particular group 
of people who took the test and the particular set of items 
in the test. This may:be most clearly explained by analogy 
to the physical measurement of temperature in the days when 
chemists (or alchemists) made their own thermometers. 

In Figure 3, the horizontal lines Tl, T2, and T3 are 
thermometers. The letters "A" through "Q" represent 
specific observed melting and boiling :;/Oints for various 
materials, e.g., alcohol, water, paraffin, lead, and so 
forth. Note that 11 and T2 share points "B« and "D". 12 
and T3 share points "I" and "M". But Tl and T3 share no 
observed points in comson* No matter how the individual 
thermometers were originally graduated or where their 
arbitrary zero points were placed, the relative positions 
(ordering) and distances between observed melting and 
boiling points would remain the same. Thus, the 
observations that are in common to two thermometers can be 
used . to calibrate the measurements on one thermometer 
against the other. Because Tl and T3 are linked through 
common observed points on T2, the information on all these 
instruments can be placed on a single temperature scale 
running from to "0". The location (parameter) of a 
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melting point of one material is invariant uith respect to 
the relative locations (ordering and distances) of other 
aelting points* 



Figure 3^ Invariance of Parameter Locations: 
Ordering and Relative Inter-point Distances 

Tl: /i..«B D ..^.G 

T2: B C«..D« I<.«..J M.N. 

T3: F L.«M..«.C 

C...D..E.F....G..H...I....J..K.L..H.N..0 

Latent Attribute 



Latent-trait analysis of the responses of a group of 
people to a group of iteias on a test produces estimates of 
their locations (i.e.^ true ability of perssns^ Intrinsic 
difficulty of Items) on an underlying trait. Figure 3 can 
be used to represent different objective tests (i.e«^ Tl^ 
T2, and T3) with the letters being either items or people or 
both. When this is done, one can aake very concrete 
predictions about a person's performance on items to which 
that person has not previously responded. Also, the results 
of a test composed of any combination of the items uhose 
locations are represented by. the letters "A" through "0" 
could be translated into equivalent scores for tests Tl, 12, 
and 13 because all the items can be calibrated against each 
other. This is all possible because, like melting points, 
the location (difficulty) of items on the latent trait are 
invariant with respect to their ordering and inter-item 
distances. Likewise, relative positions cf person abilities 
are invariant with respect to other persons' abilities and 
item locations. By contrast, conventional item statistics 
reflect only the relationship between a particular group of 
examinees (or a similar group) and the particular items on 
the test. For example, an item'^s difficulty index .( jnlike 
the item^'s intrinsic difficulty) simply indicates the 
proportion of examinees that correctly answered it, or would 
be expected to correctly answer it in a comparable group of 
examinees. The conventional item discrimination index is 
similarly limited in meaning and usefulness. Parameter 
invariance Is the characteristic of latent-trait models 
which make thes uniquely us:eful. 
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Not surprisingly^ laan^ Rasch applications are designed 
to capitalize upon paraseter invarlance to gen^erate 
eguivalent tests composed of different items or equate the 
results fif one test with anotliei having overlapping items. 
¥Ms is clearly illustrated by Anderson^ Baker^ Laguna^ a.vd 
Laguna^s (19130) use oi the Rasch SQodel to obtain comparable 
test scores based on overlapping but not identical sets of 
test items in Neurolosy clerkship er.amiaations« Anderson et 
al.^'s work is exceptional in that it involved an application 
nt the Rasc^ model to classroom level data sets ci/ntalDing 
only 7 to 10 stuients p<2r ejcam« Nore conmor^ly, Rasch 
techniques <ire applied uhen the number of persons uho have 
responded to the it^^^QS is 200 or loorc. The uncertainty 
(raeasureraer.c error) associated with an item's difficulty 
tends to be 5QUch bigger than thdt associated uith a melting 
points in practical uork it is not unusual for a ssr^all 
percent of the iteas in a given test to not fit the Rasch 
isodel« These ere identified and discarded so that they do 
not adversely affect the estimation of the Intrinsic 
difficulties of the remaining items* 

Anderson et al« cite several Rasch applications in 
health professions education Including: a pharmacy 
cxternship (Saltti and Kiteri^ 1980), analysis of the Medical 
College Admission Test sub-part scores (Cromier,1977)/ and 
analyses of tests of the National Board of Medical Examiners 
(Hughes, 1979; Krelnes and head, 1979)* Schumaker (1979) 
applied the Rasch model to the problemii of equating medical 
examinations. Harasym (1981) used Rasch techniques in 
comparing Nedelsky^s (1954) and a modified form of Angoff's 
(1971) procedures for setting passing standards for 
objective tests. 

Our Rudimentary Theory of Perforagnce Katlng 

Ne propose that the rating obtained by a subject is a 
function of the subject's achievement and the rater's 
leniency and sensitivity. Neither achievement nor leniency 
is directly observable^ but, each underlies and partially 
accounts for observable behavior* Subject achievement 
accounts for subject performance only In part. Factors such 
as illness, inappropriate working conditions, action or 
inaction of others (e.g., a hostile co-uorker or examiner) 
can either improve or reduce the quality of the observed 
performance regardless of the subject^s true level of 
achievement. Similarly, the rater's leniency and 
sensitivity account in part for the ratings given but the 
ratings also reflect the performance that uas observed and 
rated^ 

Both rater leniency and subject achievement are 
measured upon a scale of the same latent trait, factor or 
attributeN. G^nerically, this underlying trait is called an 
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ability and could fee any skilly coinpetency^ or disposition^ 
uhether innate or acquired* Leniency and achleveiaent may 
^ach be represented by points on this ability scale« These 
points are called the rater reference point (HRP) and the 
subject achievement point (SAP) respectively*. 

The rater reference point (RRP) is used by the rater as 
an implicit standaxrd for Judging the perceived performance 
of the subject*. The location of the rater reference point 
(RRP) embodies the rater*'s prior knouledge^ understanding^ 
and beliefs regarding (a) f lindamentaljf idaal standards 
relevant to the trait at issue> (b) the subject (person) 
uhose performance or product is to be rated; (c) the task or 
activity to be performed by the subject; (d) the constraints 
imposed by the setting upon either or both the rater and ' 
subject; (e) where problem solving (bcoadly construed) is 
involved in the subject's task^ the intrinsic difficulty of 
the problem; and^ (f) related factors. The rater reference ( 
point may be vleiied as arising from an adjustment the rater 
makes to some implicit^ fundamental standard. The 
fundamental standard is appropriate only to an ideal set of 
rating circunsxances^ i«e«^ conditions under uhich nothing 
but the standard and the performance need be considered in 
determining the rating. The rater reference point (RRP) 
results from the rater^s effort to take all the 
discrepancies between an ideal setting and the actual one 
into account prior to assessing the subject's per fori&ance« 
The rater reference point (RRP) embodies all factors which 
systematically Influence the rating assigned except the 
subjects porformapce and effects related to the rater's 
resolving pouer and sensitivity. 

Implicitly^ the rater perceives the subject's 
performance as a deviation on the relevant ability scale 
from the rater's RRP. The size and direction^ (above or 
belaH the RRP) directly equals the distance from RRP to the 
subject achievement point (SAP) on the ability scale^ as 
Judged by this rater« The rating assigned is a function of 
the differenc€» between RRP and SAP. 

The rater's resolving powers i.e.^ the precision of the 
rater's Judgesents as embodied in the assigned ratings^ is 
<jreatest when the difference between RRP and SAP is minimuifl. 
Resolving power diminishes in an accelerated manner as the 
difference between RRP and SAP increases. Generally^ small 
differences in value for SAP's near the RRP result in 
substantially different assigned ratings. As distance from 
the RRP increases^ larger and larger differences between tuo 
SAP^s must be present for there to be an appreciable 
difference in the corresponding assigned ratings. These 
relationships are analogous but not equivalent to those of 
visual resolving power. Objects close to the observer need 
not be separated from each other by very much to be seen as 
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distinctly not at the same rlistance. Uut as distance from 
the observer increases^ the distance betueen objects nust 
increase if tbey are to be recognized as being at diflerent 
distances f rota ^.the observer^ Because resolving pouer 
diminishes in an accelerated aanner as distajice fron RRP to 
SAP increases^ the rater characteristic curve (RCC)# uhich 
specifies the rating assigned as a function of the 
difference betueen RRP and SAP/ is one of a family of 
smooth, continuous, "s-shaped"" curves. (A member of this 
family of curves is coamonly called an ogive, e«g*, the 
normal ogive.) 

Sojoe rater's have greater sensitivity than do other 
raters. Variation in sensitivity betueen raters is defined 
by differences in the rate of acceleration in change of 
resolving pouer. Houever, rater sensitivity . is soiseuhat 
more easily grasped intuitively in terms of the difference 
in subject achieveoent associated with a given pair of 
ratings, for exaaple 10% (of possible points) and 90%. A 
highly sensitive rater would give these ratings when there 
uas a relatively small difference in tuo subject's 
achievenento A less sensitive rater uould give these 
ratings when there uas a relatively much larger difference 
in the achievesent of the tuo subjects. The Unit of 
hypersensitivity is characterized by a rater that gives only 
minimum or maximum ratings^ Any SAP less than the 
hypersensitive rater'^s RRP receives a rating of 0%> any SAP 
equal to or above this rater's RRP receives a rating of 
100%. Graphically, the hypersensitive rater's 
characteristic cturve (iiCC) is no longer a continuous, smooth 
curve* It has become tuo horizontal lines, one at 0% 
extending down the ability scale from the RRP; the other at 
.^00% extending from the RRP up the ability scale« fly 
.ontrast, the limit of hypo-sensltivlty is characterized by 
a r^ter uho assigns all SAP's the same value as if they uere 
nn different from this rater's RRP. Graphically, the 
hiro-sensitive rater's characteristic curve has become a 
horizontal line extending indefinitely in each dire'>:tion 
from the RRP parallel to the ability scale at the rating 
level associated kith this rater's RRP. 

The meajs^ure of rater sensitivity is the slope of the 
RCC at the point on the RCC directly above the RRP on the 
ability scale. The hypersensitive limit is defined by the 
value of the slope having become indefinitely large. The 
hypo-sensitive limit is defined by a fiCC slope of zero. 
Neither limit occurs in practice, though they may be 
approached. 

The theory of performance rating proposed above may be 
understood by analogy to latent-trait test theory. Instead 
of locating test items and examinees (persons), the proposed 
theory locates raters (persons) and subjects (persons or 
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products) on an underlying trait« Item difficulty is 
replaced by rater leniencyj probability of answering 
correctly is rex^laced by rating points assigned; and item 
discrimination by rater sensitive ty« lieconsidcring Figure 
2^ A and B are rater characteristic curves (RCC)« Rater A 
has a leniency of K Ci«e«^ rater A's GAP is located at K)« 
Rater B has a leniency of L» A subject uitlh an achieveiaent 
point (SAP) located at L uould receive a rating of 50% from 
rater fl; and^ a rating of near 100% froro rater A« 

As proposed^ our theory is only rudimentary* iiany 
things potentially characterizable as separate factors have 
been subsumed into the construct of rater leniency* For 
example/ ••cases% '•problems®', and ^'settings" {fee*, things 
with uhich the subject must contend) might be represented as 
a separate construct* Then ue slight able to separate the 
coroponents of rater leniency regarding the rater's 
estimation of task demands from the rater's leniency in 
assigning ratings when tasb: demands do not influence the 
location of the rater's RRP« An analog to the "guessing 
parameter'' scmetlraes used in latent-trait test theory might 
be the presumption of a '^minimusa existing competence*'* This 
uouid function t^ limit the minimum rating a rater would 
assign regardless of how poor the observed performance was* 
Elaborations such as these hardly seemed justified to us 
until some empirical tests of the more rudimentary version 
had been completed* 

Simplifying Assumptions 

To facilitate our initial eispirical investigations ue 
imposed the following simplifying assumptions upon the 
rudimentary theory presented above: 

!• All raters have equal sensitivity* Under this 
condition the slope of the rater characteristic curve is no 
longer a measure of rater sensitivity; not even mean rater 
sensitivity* Any convenient unit (graduation) of 
measurement may be chosen for the ability scale* Even 
though a different size unit produces a different value for 
the slope^ this does not imply a change in sensitivity 
because the relative distances among raters and subjects 
remain constant* When equal sensitivity is assumed^ 
sensitivity becomes perfectly confounded with leniency and 
ability* 

2c The rater characteristic curve evaluates the 
difference between a rater reference point (RRP) and subject 
achievesent point (SAP) as the percent (%) of possible 
rating points* 

3* The rater reference point (RRP) for any rater is 
located under that rater's characteristic curve (RCC) on. the 
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ability scale at that point which evaluates to a rating of 
50%. This appears to represent a potentially large and 
strongly counter-intuitive departure from the construct o£ 
the RRP as presented in the proposed theorya Intuitively it 
might seem that in typical rating circumstances a rater's 
reference point vould be near some traditionally significant 
value/ e*g«/ *75%« This arises in part from considering tjhe 
RRP as if it were equivalent to the obstenslble^ conscious 
standards in ccm»on use* A careful eixaniination of the 
definition of the BRP given above suggests that its 
relationship to such conscious standards may be very remote 
and complex^ At any rate, we judged that the gains In 
mathematical and conceptual tractabillty had from imposing 
this assumption Justified its use^ at least during our 
Initial empirical investigationso 

Our Simplified Performance KatingModel 

Rore formally^ ye propose that the ability scale upon 
which rater reference points (RRP) and subject achievement 
points (SAP) are located is an equal interval scale of 
arbitrary graduation (unit) and arbitrary origin (zero 
point)^ For the purposes of this research^ we operationally 
define the rater characteristic curve (RCC) as the product 
of an arbitrary positive^ constant scaling factor {SF) and 
the cumulative unit-normal deviate ogive. The scaling 
factor is abitrarily set equal to 100« Th^ difference 
between a rater reference point (RRP) and subject 
achievement point (SAP) divided by the scaling factor (SF) 
gives an ability scale deviation value iz}t 

Formula 1 

Z=(SAP - RRP)/SF 

The proportion of possible rating points assigned for a 
given value of z is equal to the total proportion of area 
under the unit-noirmal curve below z^ that is p(z). 
Multiplying the proportion p(z) by 100 gives the expected 
subject rating (ESR) in percent units: 

Formula 2 

ESfi=:p(z) 100 

The relationship between the expected subject rating (ESR) 
and the discrepancy between RRP and SAP is depicted 
graphically in Figure 4. 
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Figure 4m Expected Rating as a Function of 
Distance Between RRP and SAP 



CWHeN SCALINS FACTOR CSF> = 100. :> 
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distance: from rater point to subject point 



There may be variation in the rater's perception, 
knouJledge/ judgefi^ent, and so f^rth« Therefore, the observed 
subject rating (OSR) may contain error: 

•» 

Formula 3 
OSR=ESR*€rror 

In Hambleton/ Swasinathan, Cook, Eignor, and Gifford's 
(1978) terras, our model is somewhere between Lord's (1952; 
1953) two parameter normal ogive model and Rasch's (1966) 
one parameter logistic model* Conceptually it is somewhat 
closer to Rasch's model, although it uses the normal ogive 
as does Lord^'s. It was not until our model was developed 
essentially to the level presented above that we somewhat 
belatedly recognized some of its conceptual and formal 
relationships to Rasch's and Lord's objective test 
measurement models* 
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Objective s 

The objectives of the research reported here were to 
determine the extent to which a normal-ogive model of a 
proposed latent-trait theory of performance rating: (a) fit 
data of a type common to health professions educatioHir i^e.^ 
dirty and incomplete ratings of clinical performance; (b) 
clarified anc quantified the separate contribution of (1) 
all rater characteristics as embodied In the single 
theoretical construct of leniency and (2) the construct of 
the subjects underlying (i^e*/ latent) true achievement to 
the observed dirty and incomplete ratings; and/ (c) appears 
to provide a basis for generating more reliable and valid 
measures of performance than the inean of the observed 
ratings on a subject when the rating data is not only dirty 
and incomplete but the subsets of raters are 
unrepresentative of the whole relevant rater pool* 

Method 

Data Source ^ Data analyzed were samples of convenience 
available froK a project whose objective was to develop a 
machine based system for processing clinical performance 
data* As part of that projects a prototype machine readable 
(optically scsrned) form was used experimentally (Cason and 
Cason, 1979) • Data collected on this experimental form were 
analyzed here. 

Subje cts and Cohorts , The subjects upon whom rating 
data wire available were third year medical students 
enrolled in a medicine clerkship^ i.e., a clinically 
oriented course in internal medicine* Data were available 
from the third and fourth cohorts (!•€•, groups of students 
concurrently taking the course) in academic year 1978-79 and 
the second cohort in 1979-aO* The third cohort took the 
course during the winter months; the fourth during the 
spring; and, the^ second during the fall- Table 2 gives the 
number of students In each cohort. 

Clerkship and Setting - The medicine clerkship was 12 
weeks long with six weeks spent at each of two training 
sites: University Hospital and Little Rock Veterans 
Administration Hospital. In the wards, instruction was 
entirely tutorial and small group basedo Faculty attending 
physicians and residents each had a small number of (usually 
at least two but less than six) medical students randomly 
assigned to them for instruction- Residents tended to have 
more contact with students than did the faculty- 

Hating Instrument - The machine processable form 

contained a 33 item clinical performance rating inventory- 

Ihe items tere divided into seven non-overlapping 

categories- Raters could assign a rating value of from 1 to 
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5 to each item or indicate that it was either not observed 
or was not applicable* Rating values were defined in 
explicitly nortR-referenced terais rather than being 
behaviorally anchored* For exaoiple, a rating of "4" was 
defined as "A little better than the typical student in the 
typical class (i.e./ would be in the top 25% but below the 
top 10%)''* Appendix A contains a facsimile of the fortD. 
For scores on the full inventory Ci«e«/ mean of valid 
ratings to all items), previous research (Cason and Cason, 
1979) indicated a mean interrater correlation of r=0.50> 
ranging from a high of r=0»71 between residents and faculty 
at the same training site to a low of r=0«23 for ratings 
given by residents at one site and faculty at another. 

paters anc Rating Procedures ^ Ihe raters were the 
faculty attending physicians and residents., who trained the 
medical students* Most students were rated' by two attending 
physicians and one resident at University Hospital and by 
one attending and one resident at the VA Hospital (niode=5 
ratings/student). Raters received a 20 minute oral 
explanation of tte proper use of the rating form (from G. 
Cason) and 2 written memorandiim restating the details. No 
other rater training was used* At the conclusion of the six 
weeks students spent at a training site^ raters completed a 
form on each student with whom they had contact. Raters 
entered only rating data. The various identification data 
grids were completed by a departmental clerk. After the 
forms were optically scanned and an electronic (computer 
disk file) copy made/ they were placed in the respective 
students'" permanent files. The number of raters for each 
cohort is given in Table 2. The nuiBber of raters 
overlapping cohorts (i.e.^ rating students in more than one 
cohort) is given in Table 3. 

Dependent Measure . The dependent measure of clinclal 
performance was operationally defined as the mean valid 
rating across all items in the inventory, rated by one 
rater, expressed in percent form. A valid rating was any 
rating of 1 through 5. Blanks, multiple marks, not 
applicable and not rated were non-valid ratings. Although 
the inventory contained items of both the affective, 
interpersonal skills type and the cognitive, technical, 
problem solving type which prior research (e.g., Davis, 
Hull, Davidge, and Dielman, 1979) indicated belong to 
statistically independent (orthogonal) factors, the global 
trait represented by the mean across all items, i.e., 
overall achievement in clinical performance, was chosen. 
This was done because (a) with missing data at the item 
level, unbiased estimates of the separate factor scores 
could not be obtained with any certainity> (b) extracting 
factor scores (by factor analysis) is a scaling procedure 
which results in "cleaner" scores, thus results of further 
analyses based upon these factor scores might be 
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contaminated by and attributed to the effects of the factor 
analysis; (c) the only available unbiased measure of both 
student performance and rater judgement uas the mean of the 
valid ratings across all items on the Inventory. 

Estimation of RRP^s and SAP'^s ^i Program HERLIM (Casoriy 
1980) was used in conjunction with subroutine STEPIT 
(Chandler, 1965) to obtain least-squares estimates of the 
rater reference points (RHP) and subject achievement points 
(SAP). Briefly/' MERLIN operates as follous. An observed 
data table with one row per subject and one column per rater 
is input. All observed subject ratings (OSR) are contained 
in this data table. A set or "best guesses" for the RRP's 
and SAP*s are input. In actual practice^ we started with 
very bad guesses:, all RRP's and all SAP's equal to 5O0. 
The program uses these starting guesses for the SAP's and 
RRP's and the fiinction depicted in Figure 4 to calculate an 
expected subject rating (ESR) for every cell In an expected 
data table. Then/ the discrepancy between each value in the 
observed data table and its corresponding value in the 
expected data table is found and squared. When all the 
squared values are summed, the result is the error 
sufli-of-squares (ESSQ) for the fit between the predicted 
ratings generated from the current set of "guesses" for the 
SAP's and RRP's snd those ratings actually observed. STEPIT 
is used to successively alter (i.e., step) the guesses for 
the parameters and evaluate the inipact on the resulting fit. 
When changes to the parameter values no longer produce 
appreciable improvement in the fit (reduction in the 
error-sum-of-squares) between the observed and predicted, 
MERLIN outputs a series of reports. These reports include 
the least-squares estimates of the RRP's and SAP's, the 
complete table of predicted ratings, measures of final fit 
(r and ESSQ), results of an F-test between the proposed 
model and the null hypothesis, and so forth^; This process 
requires that one parameter be fixed (i.e., held at a 
constant value throughout the estimation process) to anchor 
the scale. A senior faculty meisber who rated at least 6 
students in each of the cohorts was used for this. This 
rater's RRP was held fixed at 500. 

NERLXN was run on a Digital Equipment Corporation 
System 10 (DEC-10). Parameter estimates were determined on 
each cohort's data separately. Central processing unit 
(CPU) time required to find least-squares estimates was as 
follows: Cohoct 1978-79: 3 with 75 free parameters to be 
estimated rec Jired 82 minutes of CPU time; Cohort 1976-79:4 
with 47 free parameters required 29 CPU minutes; Cohort 
1979-80:2 with 63 free parameters required 36 CPU minutes. 
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Results 

Fit was ceterminec ror four models on each cohort's 
data separatelyM Ihusy each cohort represented an 
independent replicatioji* 

Model A was the modal proposed above with one free 
(RRP) parameter per rater (except for one which was fixed at 
500 to anchor the scale) and one free (SAP) parameter per 
student. Model A permitted, but did not require that, both 
rater leniencv and s^ubject achieveiEent contributed to the 
fit between the predicted and observed ratings* If there 
were no appreciable differences in raters' leniency, the 
least-squares values of the RKP'^s found by MERLIN would all 
be near the same value (i.e», 500). Similarly, if thet^ 
were no appreciable differences in students* achievement, 
the least-squares values for all the SAP's found by MERLIN 
would be near the same value. Table 1 provides descriptive 
statistics (means and standard deviations) for the estimated 
values of Model A-s RRP's, SAP's, as well as observed 
ratings for each cohort* Means for each of these variables 
were quite similar across all three cohorts* Model A was 
the most general model considv^red. Models B and C were 
derived by imposing restrictions upon Model A« 



Table 1 

.^eans and Standard Deviations (SD) for RflP's, SAP's, and 
Ratings Based upon the Full Data Set 



RRP SAP Observed Ratings 



Cohort 


Mean 


SD 


Mean 


SD 


Mean 


SD 


78-79:3 


485.50 


38.1-7 


558.03 


37.72 


73.49 


11.75 


78-79:4 


476.60' 


37.64 


549.48 


23-99 


74.19 


8.16 


79-80:2 


486.49 


21.17 


545.52 


27,13 


72.55 


7.77 



Model fl imposed the restriction that all raters are 
equally lenient, i*e., all RRP's equal 500, while allowing 
SAP's to vary. This restriction forces the predicted 
ratings for the raters of a single student to be the 
unweighted mean of the observed ratings of these raters on 
this subject. This is the model corresponding to the common 
practice of using the aean of the observed ratings as the 
best measure of the student's true performance. Note 
however that it is accurate only within the context of equal 
rater leniency. When contrasted with Model 0 (null 
hypothesis). Model B provided a mechanism for determining 
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how well variation in student performance could account for 
observed ratings* Alsoy statistical contrast of Model B 
(achieveioent) with Model A (both achievement and leniency)/ 
provides a way to determine if rater leniency contribr.tod to 
observed ratings^ beyond that accounted for by student 
achievement* A statistical difference between A and B 
indicates a "leniency main eff ect*** 

Model C isposed the restriction that all students had 
equal achievement/ I.e./ all SAP's equal 500/ while 
permitting all the RRP's to vary* Khen contrasted with 
Model 0 ( nul 1 hypothesis)/ V.odel C provided a way to 
determine the extent to which variation in the observed 
ratings may be accounted for by variation in rater leniency. 
Also/ when contrasted uith Model A (i.e./ both achlevment 
and leniency)/ Model C (leniency) provides a way to 
determine If student achievement makes a significant 
contribution to observed ratings beyond that which could be 
ascribed to variations in rater leniency. A statiisi^ical 
difference between Models A and C indicates an "achievement 
main effect". 

Model Q eflbodies the null hypothesis/ i.e./ a model 
which accounts for the observed data as chance (random) 
variation froR the overall mean rating (across all raters 
and students). Models B and C were not "stxaw-men" intended 
to make the proposed model (A) look good. All three 
hypothetical uodels must be used in contrast with each other 
and with the null hypothesis to determine the relationships 
of interest. 

Table 2 presents the results of formal/ statistical 
contrasts between the proposed modGl (A)/ as the full model 
(FM) and each of the others (e.g./ B/ C/ and 0) as the 
restricted (FN) model (Viard and Jennings/ 1973; see also 
Sternberg/ 1967). All the F-tests resulting from the 
contrasts repojcted in Table 2 produced statistically 
significant F*s (p<0.01). Table 2 also provides measures of 
the fit between each model and the three data bases. The 
fit is Indicated both by the correlation (r) between the 
observed and predicted ratings and by the associated 
error-sum-of-squares (ESSQ). In all three cohcrts/ the 
proposed model (A) xlt better (r=0.82/ 0.74/ 0.70) than 
chance (p<0. 01)/ better than Model B (r=0.72/ 0.55/ 0.55; 
P<0.01)/ and better than Model C (r=0.44/ 0.59/0.33; 
P<0,01). The contrasts between models A/ B/ and C indicated 
that both rater leniency and student achievement made 
statistically significant (p<0«01)/ independent 

contributions to the observed ratings in all three cohorts. 
In conventional analysls-of-variance terminology/ the 
results supported the conclusion of * a signficant (p<0.01) 
rater leniency main effect and a significant (p<0.01) 
student achievement main effect in each of the three 
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cohorts. 



Table 2 

Contrast of Fit of Models Ay and C to Data froa 
£ach of Three Junior Year Medicine Clerkship Cohorts 



Table !. Contrast of Fit of Models A, B, and C to Data from Each of ITiree Junior Year Medicine Clerkship Cohorts 



Oata Base 






Free 


Parajneter:; 


(nfp) 




Fit . 




Contrasts 




Cohort nR nS 


nOB ^f^ 


RRP 


SAP 


Total 


r 


ESSQ 


FM 


RM F 


ratio* 






A 


46 


29 


75 


C.8213 


4364.65 


A 


0 


4.85 


78-79:3 47 29 


136 


B 


0 


29 


29 


0.7187 


11371.64 


A 


B 


2.13 






C 


47 


0 


47 


0.4436 


15713.56 


A 


C 


5.66 






A 


30 


30. 


60 


0.7441 


6767.68 


A 


0 


4.53 


78-79:4 31 30 


165 


B 


0 


30 


30 


0.5456 


13692.51 


A 


B 


3.58 






C 


31 


0 


31 


0.5890 


12638.68 


A 


C 


3.14 






A 


28 


35 


63 


0.7000 


7219.70 


A 


0 


3.74 


79-80:2 29 35 


173 


B 


0 


35 


35 


0.5529 


12332.16 


A 


B 


2.78 






C 


29 


0 


29 


0.3333 


16474.84 


A 


C 


4.15 



*For all reported F's, p<0.01. MT« nodcl type; n^nimber; R«raters; S-students ;OB"observat ions (ratings) ; 
r«Pearson correlation; ESSQ-exxor sum of souaxes; FM^fuU model; RM»xostxicted model; dfj^nfpFM-nfpRM; d£2"nOB-n£pFM. 

Because the study was replicated on three Independent 
data bases and the same results uere obtained on each/ the 
Joint probability across all three cohorts for each of the 
results cited above was p<0* 000001. The probability values 
given in the prior paragraph refer to each data base 
considered separately* When all were considered together 
the smaller value just given should be substituted for the 
earlier ones# Partitioning the variance by contrasting 
models A/ B/ an<}. C/ we found that in these data about 20% of 
the variability in clinical perf crniance ratings could be 
attributed to variations in rater leniency. An additional 
35% could be attributed to variation in student achievement. 
Taken together tfcese rt:sults strqngly indicated that while a 
knowledge of either leniency or achievement provided a 
significantly better than chance basis for predicting 
ra^^ingsy each was a statistically Independent factor^ and 
the best accuracy in prediction was achieved on the basis cf 
a knowledge of both. These results directly support the 
proposed model and th^ereby indirectly the proposed theory: 
performance ratings were a function of both rater leniency 
and subject achievement. 

As some raters rated studentis in more than one cohort, 
it was possible to calculate a "tost-retest" reliability 
coefficient for the rater reference points (RfiP) of these 
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raters. Table 3 provides the reliability coefficients (r) 
determlnBd on pairs of RRP's for each rater^ Ihe number (n) 
of raters who rated students in two cohorts is indicated in 
parentheses under the corresponding r value. The 
probability (p) of the observed correlation arising by 
chance is also given. All these r eiiablllties are positive 
but belo^ r=0#30« Although no single one of these r's 
departed froir a value of r=0«QO to a statistically 
significant degree (i.e./ Individual probabilities were 
P>0.15)/ at least two of these r^'s were stat 1st ically 
independent. From their joints ^independent occurrence it 
was found that the set of r values differed significantly 
(p<0.04) from an r=0. 00. This very important result 
provides directly validating evidence for the theoretical 
construct of leniency and indirect validation for the 
construct of achievement. For these raters, we found that 
while their HRP's were labile or difficult to measure with 
precision, their RRP's corresponded to some feature of their 
rating behavior that persisted over at least a six month 
period of time. 

Table 3 

Correlations between RRP's for same Instructors 
Across Independent Cohorts of Students 
All Available Data Used to Estimate RRP's 







73-79:4 


79-80 :2 




r 


0.2796 


0.2883 


78-79:3 


n 


( 15) 


{ 13) 




P 


0.1560 


0.1700 




r 




0.2483 


7 8-79:4 


n 




( 9) 




P 




0.2600 



The mean observed rating on each student was moderately 
well correlated (r<0.95) with the rating that the proposed 
model predicted a rater of mean leniency would assign. 
Assuming (on the strength of the evidence thus far reported) 
that the proposed model was valid^ this result Indicates 
that the leniency of the various sets of raters who rated 
these students were moderately representative of the whole 
pool of 75 different raters. This would be expected as 
assignments of students to raters was random. But, random 
assignment could produce highly different subsets of raters. 
Apart from the model under investigation here/ there was no 
other technique for determining the representativeness of 
the rater subsets. The results only suggest that the rater 
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subsets were aodeiateli' representative. 

To further test the proposed tnodel^ a cross-validation 
of model predictions against an independent criterion was 
conducted. A restricted data set was created from the full 
data set. The full data set contained all the observed 
ratings on the three cohorts used in the analyses reported 
above. The restricted data set was foroed by setting aside 
(i#3./ "saving") one randomly chosen irating per student 
(with the constraint that the remaining restricted data set 
contained no rater who rated less than two students nor a 
st>dent rated by less than two raters)* Parameters (RRP's 
and SAP's) were then estimated on each cohort's restricted 
data set separately • Descriptive statistics (means and 
standard deviations) for the observed ratings^ and RRP's and i 
SAP's estimated for Model A from the restricted data set are 
given in Table 4. When compared with the values obtained on 
the full- data set (Table 1) , the reduction of one 
observation per student had no significant impact on the 
means. 



Table 4 

iMeans and Standard Deviations (SO) for RRP's, SAP's and 
Ratings Based upon the Restricted Data Set 



RRP SAP Observed Ratings 



Coho r t 


Mean 


SD 


Hean 


SD 


Mean 


SD 


78-79:3 


490-57 


45.65 


566.25 


45-70 


74.44 


12. 7C 


78-79M 


466.78 


74.11 


547.52 


24.48 


74.73 


8.54 


79-80:2 


487.84 


19.91 


549.90 


37.94 


72.29 


8.91 



The saved ratings were then correlated with the 
corresponding eleraonts in two different sets of predicted 
ratings: (a) those given by the proposed model (when its 
parameters had been estimated from the restricted data set); 
and, (b) those given by the model underlying the roost common 
rating practice/ i.e.. Model which is equivalent to the 
mean of the ratings each student received in the restricted 
data SGt. Tn each case the saved ratings were Independent 
of the predictions with which they were correlated. 

This procedure could put the proposed model at a 
subs tantiai d isadvaatage when contrasted with the alternata 
model (R). This iirises from the reduction in data available 
to estimejte parameters. By consulting lable 2, It can be 
deduced that in the full data set the ratio of observations 
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to free paraaeters (to be estimated for Model A) was 1*8, 
2.1^ and 2»7 respectively in the three cohorts. In the 
restricted data set/ these ratios declined to 1.4/ 2.2/ and 
2.2«. In cohort 1978-79:3 the ratio fell from an already 
marginal 1.8 observations per parameter in the full data set 
tr- a very doubtful 1*4 in the restricted data set. A low 
ratio could place the proposed model at a disadvantage 
because it had more parameters to be estimated* Less data 
per patamtcr would reduce the accuracy of the parameter 
estimates and thus the accuracy of the model's predictions. 
The alternate model having only about half as aany 
parameters to estimate had an advantage in obtaining more 
accurate estimates of its parameters (i.e., one mean per 
student). 

Table 5 reports the results of correlating an 
independent rating of each student with the prediction of 
the proposed model (A) and the prediction Implicit in the 
common practice of taking the unweighted mean of the 
observed ratings (Model B) as the best available measure of 
performance. In two of the three cohorts the results appear 
to favor the proposed model, but in cohort 1978-79:3, Model 
B seems to be superior to the proposed siodel. This means 
that in two of the three cohorts predictions based upon a 
knowledge of both rater leniency and student performance 
(i.e., ^odel A) appeared superior to a knowledge of student 
perf orniarice alone (Model B). in cohort 197 8-79: 3, the 
prediction of Model A was not only less accurate (i.e., less 
well correlated with the criterion), the observed 
correlation (r=0o26) for Model A was not significantly 
diffetent frcR r=0.0. Considering that J»odel B was a 
restricted case of Model A> Model A should do no worse than 
Model B. 



Table 5 



Correlations of Prediction of ^iodels A and B with an 
independent Rating on Each Subject 



Cohort 



B 



78-79:3 



0.2555 



0.5020 



78-79:4 



0.6699 



0.5531 



79-80:2 



0.4027 



0.2022 



Mean 1 



0.5136 



0.4465 



Mean 2 



0.6128 



0.>4055 
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For cohort 1978-79:3, the data indicate that very poor 
estimates for Wodel A^'s parameters were obtained from the 
restricted data set. The result ot Model A fitting worse 
than Model E was directly attributable to the lack of 
sufficient data in the restricted data set for 
simultaneously estiioatlng SAP's and RRP*s. This "negative 
finding" was serendipitously suggestive of a useful rule of 
thumb. Anytime the correlation between the proposed model's 
predictions (when based upon the parameter estimation 
procedures in MERLIN) and independent cr iter Ion ratings 
fails to at least equal the correlation between the 
criterion and each subject^'s mean observed rating (l^e./^ tiie 
prediction of Model B)/ then there are insufficient data 
avaiiaJble to make useful estimates of the parameters of the 
Model A. In t^e case at issue, this Interpretation was 
corroborated by an analysis of correlations between the 
values estimated tor Model A'^s parameters (i?RP*s and SAP's) 
based on the full data set with estimates for the same 
parameters based on the restricted data set* The results of 
these analyses are reported in Table 6. 



Table 6 



Correlations between Parameters Estimated from the Full 
Data Set with Those Estimated from the Restricted Data Set 



Cohort 
78-7953 

78- 79:4 

79- 80:2 



RRF 
0.83G0 
0.9173 
0.8676 



SAP 
0*7991 
0.9691 
0. 9625 



Both 
0.8178 
0.9508 
0.9329 



RSQ 
0.6688 
0.9040 
0.8703 



These correlations would be high if the parameter 
estimates were stable. The correlations for RRP's and SAP's 
separately ard combined Ind icated that there was good 
stability for the parameter estimates in cohorts 197 8-79 :4 
and 1979-80:2. Taking the sguare. of the correlation (fiSG) 
between the two conditions (i.e., full and restricted data 
sets) as a measure of common variance/ the stability of 
cohort 1978-79:3's parameter estimates was clearly poor 
(RSQ=0.67). Deleting one observation per student produced 
substantially different parameter estimates. Better 
estimates could not be had from less data> therefore, the 
estimates from the restricted data set must have been 
substantially worse than from the full data set. It is 
important to emphasize the extreme conditions under which 
the parameter estimation procedure failed. Complete data on 
the cohort would have contained: 47 raters x 29 students = 
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1363 observed ratings* In the reduced data set there were 
107 observationsa In other yords, 7*85% of the possible 
data were present and 92*15% of the data were missing from 
the observed data table input to NERLIN* In the other two 
cohorts, the respective data tables t?ere 17.74% and 17.04% 
complete. 

Kith the clear evidence that it was the parameter 
estimation process rather than the proposed model that 
failed and that the failure was due to lack of sufficient 
data to make useful estimates of the proposed model's 
parameters, we reconsidered the results reported in Table 5. 

Means 1 and 2 were computed using the wei</hted r to z 
mean correlation procedure recommended fay Mcileraar (1966# p. 
139). The aean correlation between Model A and an 
independent criterion (i.e., the saved ratings) across all 
three cohorts (mean 1) was higher than that obtained by 
Model B, but not significantly higher (p>0.15). However, 
ample evidence had been found which required the exclusion 
of the 1978-79:3 data from this comparison- Therefore, Mean 
2 was calculated only upon the results for cohorts 1978-79:4 
and 1979-80:2. This resulted in r=0.62 for the proposed 
model, while the mean correlation between the criterion and 
Model B predictions was r-0.41. Each of these correlations 
was significantly greater than r=0.0 (p<0.004). Further, 
the proposed model predicted the criterion significantly 
better (z=2«62> p<0.004) than did the alternative model. 
This result directly validates the theoretical constructs of 
both rater leniency and subject achievement. 

Model A's predictions correlated higher with the 
independent criterion ratings, r=0.61, because Model A^s 
predictions were more nearly valid. The raw ratings 
contained two components: subject achievement and rater 
leniency. As measures of true subject performance, the raw 
ratings were cpntaminated with rater leniency and were 
therefore less valid and reliable measures of true subject 
performance. The reliability of r=0.50 for raw ratings 
reported in earlier work (Cason and Cason, 1979) was an 
overestimate because it did not take the leniency effect 
into account. The best available estimate for the 
reliability cf raw ratings as measures of performance alone 
was the mean correlation between Model B and the criterion 
ratings in the last two cohorts (mean 2): r=0.41. Our 
model attained higher correlations with the , criterion 
because it explicitly used both rater leniency and 
achievement data to make its predictions. The model 
depicted the data more validly than could the ni6an of raw 
ratings in incomplete data sets. Therefore, the best 
available measure of student performance or student 
achievement was the rating that our model predicted a , rater 
of average 1 en i:^ncy would assign a given subject (or, its 
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equivalent on the latent scale, this subject's SAP)« 

Applying our model, the reliability of a single rating 
as a fl!easure of true i^erf orniance was r=0«6i. Leniency 
effects had been removed; therefore, Spearman-Brown's 
formula was appropriate to conservatively estimate the 
reliability of a rating based upon several independent 
raters. Specifically, our model's predicted mean rating for 
each Subject based on 5 ratings had an estimated reliability 
of r=0«89» By the same logic, the reliability of the mean 
of 5 raw ratings as a measure of true performance was 
calculated taking r-0«41 as the reliability of a single raw 
rating. Applying Spearman-Brown's formula, this gave r=0«18 
for the reliability of the mean of 5 observed ratings as a 
measure of student true performance. Because validity 
cannot exceed reliability these results clearly Indicated 
our model could produce substantially more nearly valid 
measures of student true pecfo^^a^^e from an incomplete data 
table than could the mean of observed ratings on each 
student* 

Conclusions and Implications 

All the a priori objectives of the research were 
attained* With respect to clinical performance rating data 
sets of a type which are common to health pr fessions 
education (i*e», dirty and Incomplete), the propo^;;d model 
was empirically demonstrated to have: (a) closely fit the 
data (p<0. 000001), (b) clarified and quantified the separate 
contributions of rater leniency and subject achievement 
(e«g», 20% and 35% of variance accounted for respectively in 
these data, eitpirical cross-validation of both constructs, 
and so forth); and, (c) provided a usable mechanism for 
generating more reliable and valid ratings-based measures of 
clinical perforaance as indicated by the reliability of 
r=0o 89 (basec on 5 indep endent ratings) attained from 
application of . the proposed model as compared to r^O^^O 
attained for the most commonly used current alternative, 
i*e», the mean of the 5 observed ratings. 

The results clearly demonstrated the superiority of tte 
proposed model when data sets were incomplete and subjects 
were rated by unrepresentative subsets of raters. In 
addition, an empirical method for judging the adequacy of 
the data for the application of the model was demonstrated. 
When the proposed model failed to provide fit with the data 
at least as good as the mean of each subject^s observed 
ratings, the data set was insufficient to provide adequate 
estimates of the proposed raodel^s parameters. Nevertheless, 
the proposed model provided improved measures of performance 
when the data set was as little as 17% complete. 

The conditions of the tests contrasting the proposed 
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model (A) with the mean of the observed ratings were biased 
against the proposed model. Assignment of students had been 
random so variation of average leniency In rater subsets 
would tend to be small* This tended to reduce the rater 
main effect in these data. In settings where non-random 
assignment occurs, larger discrepancies in mean rater 
leniency coulc easily occur. In such settings^ the power of 
the proposed rrodel in producing more valid measures would be 
even more pronounced^ Assuming the proposed model was 
valid/ Table 1 provides a "worst case" example^ of the 
potential impact of rater leniency upon ratings received by 
students. This example was based on the extreme (lenient 
and stringent) raters and ejctreme {low and high achieving) 
students In cohort 1979-80:2. The top row depicts the 
ratings that the most stringent rater would assign; the 
bottom row the most lenient. The left column gives the 
corresponding rater reference points for the two raters. 
The middle column gives the expected rating for the low 
achieving student; the rightmost column, the expected 
ratings for the fcigh achieving student. Both the raters see 
the high achieving student much the same; there is only a 
10% difference in ratings. But, the low achieving student 
is preclicted to receive drastically different ratings. 
There is a 30? difference in ratings. Predictions rather 
than observed discrepancies were used in the illustration 
because it was the model that was validated in this 
research. Whether discrepancies as Ittrge as this occurred 
in this data has a chance matter. The model's predictions 
were a better general indicator of the possible magnitude 
than coincidertal data because the laodel captured a set of 
relationships In uhole data sets* 



Table 7 

J^axirauu Effect of Rater Leniency on Predicted 
Student Ratings in Cohort 79-80^2 



Low Student High Student 
RRP (SAP 497.9) (SAP 653.7) 



Stringent rater (534.9) 35.59% 88.27% 

Lenient rater (452.1) .67.65% 97.81% 



In spite of the consistency/ strength^ and coherence of 
the results supporting the proposed model found in these 
data, these data were limited. Only one setting/ an 
internal medicine clerkship was represented. Only one 
rating inventory. was used. Still 75 different raters were 
involved and 94 different students were rated. It would not 
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be prudent to conclude that the proposed model will fit 
every conceivable performance rating setting* Neither would 
it be reasonable to ignore the strength of the results ttom 
these limitec data# There are too many coraraonali ties 
between these data and many others not to expect that this 
model may prove very useful in a wide variety of settings 
and contexts* 

Extrapolating optimistically from these early^f 
promising results, a nuraber of useful possibilities occur to 
us* Our model might meet Heskauskas and Norclni's (1980) 
requireujents for a methodology for "handicapping" judges in 
both standards setting and performance assessment procedures 
better than do Stanley *s (1961) methods. Our results 
suggerit that in some settings rater leniency may not be 
suf JL iaiently stable to use Stanley''^ methods* However/ 
because our model can be applied to incomplete data sets, it 
provides a means of "adjusting" judges' ratings on the bases 
of their current behavior rather than on their past ratings* 

An intriguing possibility is the application of our 
model to the problem of assessing the test Items in a large 
item bank* Sone test item banks now have thousands of itescs 
in them. Bvty these items are not equally relevant to the 
objectives of specific training programs which may use these 
test item banks. Our model would permit a more uniform 
standard to be applied in judging the difficulty or 
relevance of items in the item bank while reducing the 
extent to which redundant judgements were required. For 
example, our model might permit judges to consider only 
slightly overlapping subsets of Items uhile applying 
Angoff's fA911) or a similar standards setting method. The 
judges'' judgements could be calibrated through the common 
items that they judged. This would permit a small number of 
judges (e.g./^ the faculty in a department) to evaluate a 
larger item bank without either taking years or imposing an 
unrealistic burden on the individuals. 

Our model provides a technique whereby it would be 
possible to "track" the rating performance of individual 
raters and provide thea with feedback on how their ratings 
compared with other raters in settings where not all caters 
rate all subjects. This might even be useful in settings 
where raters had been trained to a very high level of skill 
so that only few raters would rate each subject. So long as 
there were adequate pj/erl^^ ratings, the model would 

provide a way of monitoring raters that was non-intrusive 
and inexpensive since it requires only their routine rating 
data. 

There are at least two general ways In which our model 
may prove to be of research interest. First, the model 
Itself, in so far as it is a simplification of a somewhat 



ERJC 



32 



Cason and Casons Perforraance Rating 



Page 31 



more elaborate theory^ deserves inv^^t^i^^ti^n* Perhaps 
incorporation of differential rater sen^i tji\/4,ty-^ explicit 
repr esentatior of problem or situation ^Mtic^^^Y ^ or other 
elaborations of the proposed model woul^ J^gd to further 
improvements in ratings-based measur^^ ot complex human 
perf ormance. However, such elaborat i^fi^ wi?uid involve 
adding parameters and this would x^^y^i^e more nearly 
complete data sets if useful estimates Parameters 
were to be achievable. In spite o| tH^ success of the 
simplified model in fitting and explaintj^^ th^ relationships 
in these data, the model is a gross sijjj^Ut i^ation of even 
the rudimentary performance rating th^^ty ihat we have 
proposed. 

Second, the proposed model may be n^^jul an analytic 
method in research involving complex h^^J^^ Performance as 
either a criterion or predictor vari^fcl^^^ With notable 
exeptions such as Sheehan, Husted, C^ncJe^^ Cook, and 
Sargent's (1980) report, prior investigations of the 
relationships between complex performance v^ti^bles (such as 
clinical perfornance) and variables ig^^^^r^^ by more 
reliable methods (such as objectively ^cat^d aptitude and 
achievement tests) have found only very m^tJ^^t ^elationi'hips 
or none at all. This csay have arisen in ^^ft pecause of the 
relatively low reliability and/or valldtt/ ot the available 
ratings-*based measures of complex gexiot^^^tiC^^ The proposed 
model may have a substantial contribution^ to tn^^e to these 
invefitigatlons by providing a way to jd^xe nearly valid 
and highly reliable measures of corapl^^ p^^i^twance than 
have been available in the past* tuts prospect is 
especially exciting for those areas of (j^Cf ottnance where 
there are already large but dirty and ^f^KiO^pX^te data sets 
available and/or those areas which, for ^t^Cti^al reasons, 
may be unable ta concurrently prod^c^ both clean and 
complete data sets regardless of the reso^^^^^s Available* 

While it is desirable that the judg^j^^nt^ individual 
judges be made as reliable and valid as p^js^ible, there 
will almost certainly always be more ^^^e^^t^^^t programs 
that generate incomplete, dirty data ^^t^ than complete, 
clean ones. The inodel we have presented h^r^ shows real 
promise for improving the quality (?X it^^ assessment 
information that mj^iy be extracted undej; t^e^^ iess than 
ideal and unfortunately common circumstat^^^^^^ 




Cason and Cason: Ferforiaance Rating 



Page 32 



References 



Anderson/' DaQey Baker,^ H«ll«^ Lagunay J«E«y and Laguna^ J«F« 
Applying the Rasch model to improve health science clerkship 
evaluations* Presented at the Annual Meeting of the Rocky 
Mountain Educatioral Research Association, Las Cruces, N. 
1980. 

Angoff, W*H. Scales^ norms and equivalent scores. In R«L. 
Thorndike (Ed.) Fcucational f^easur ement (2nc ed.)« Washington^ 
O.C.s American Council on Education/ 1971. 

Raker, F.B. Advances in item analysis. ReviiitM of Educational 
Research, 1977, 47, 15l-i78. 

Cason, G.J. HERLlNs A FORTRAN IV program for finding 
least-squares estimates of rater reference points, subject 
achievement points, and goodness-of-f it for Cason and Cason's 
model of performance rating. Copyright 1980 by Gerald J. Cason. 
(Available from author.) 

Cason, G.J., and Cason^ C.L- Rating students' clinical 
performance: Intcriir report number 2. Presented at the Annual 
Meeting of the Mid-South Educational Research Association, Little 
Rock, Arkansas, 1979. 

Chandler, J.P. STEPIT: A FURIRAN II subroutine for finding 
local XDiniiaa of real functions. Copyright by J.P. Chandler. 
(Available from Cuantura Chemistry Program Exchange, Indiana 
University: Bloomington, Indiana* ) 

Cromier, G. A stucy of the applicability of a truly objective 
model In medical education. In Proceedings of tae Sixteenth 
Annual Conference or Research in Medical Education . ^^ashington, 
D.C.: American Association of Medical Colleges, 1977, 123-128. 

Davldge, A.M., Davis, W.K., and Hull, A<»L. A system for the 
evaluation of medical students' clinical competence. Journal of 
Medical Education, 19S0, 55, 65-^57. 

Davis, W.K., Hull, A.L., Davldge, A.M., and Dielman, I.E. 
Variables influencing ratings of medical student's clinical 
performance. Presented at the Annual Meeting of the American 
Educational Research Association, San Francisco, 1979. 

Dielman, T-E., Hull, A.L., and Davis, S^K. Psychoroetric 
properties of clinical performance rating. Evaluation and tiie 
Health Professions ^ 1980, 3(1), 103-117. 

Ebel, R.L. Estiaation of the reliability of ratings. 
Psychometrlka , 1951, 16, 407-424. 



ERLC 



34 



Cason and Cason: Performance Rating 



Page 33 



Harabletonr R*K*^ Suaaiinathan/ Cook/ L.L./ Eignor/ DoR^/ and 

Gifford, J. A. Developments in latent trait theory: Hodels/ 
technical issues/ and applications. Review of Sduca tional 
Research / 1978/ 48/ 467-510. 

Harasym/ k comparison of the Nedelsky and modified Angoff 

standard-setting procedure on evaluation outcome. In Proceedings 
of the Nineteenth Annual Conference on Research in Medical 
Education . Washington/ D.C.s American Association of Medical 
Colleges/ 1980^ 3-8. 

Hughes, F.P. The Rasch model applied to the equating of several 
examination forraso Paper presented at the Annual Meeting of the 
American Educational Research Association/ San Francisco/ 1979. 

Kreines/ D.C*/ and J^ead, R»J, Equating tests with the Rasch 
model. Paper presented at the Annual Meeting of the American 
Educational Research Association/ San Francisco/ 1979. 

Landy/ F.^ and Barnes/ J. Scaling behavioral anchors. App lied 
Psychological Measurement / 1978^ 3(2)/ i93--200. 

Lord/ F.M. A theory of t^^st scores. Psychometric Monographs / 
1952/ No. 7. 

Lord/ F.M. An application of confidence intervals and maximum 
likelihood to tie estimation of an examinee^'s ability. 
Psychometrika / 195 3/ 18, 57-75. 

WcNeraar/ G. Psycto logical statistics C3rd Ed*). Heu York: 
Wiley/ 1966. 

Head/ R.J./ Wright/ B.D./ and Eelly S.ii. BICAL- Version 3. 
Computer program to perform Rasch item analysis. Chicago: 
University of Chicago/ 1979. 

Meskauskas/ J. A./ and Ncrcini/ J.J. Standard-setting in written 
and interactive (oral) specialty certification examinations: 
Issues/ raodels^> methods/ challenges. Evaluation and the Health 
Professions ^ 1980/ 3(3)> 321-360. 

Nedelsky/ L. Absolute grading standards for objective tests. 
Educational and Psychological Measurement / 1954/ 14/ 3-19. 

Nunnally/ J.C. Psychome trie theory. New Xorks McGraw-Hill/ 
1967. 

O'Donohue/ M.J.y and Sergin/ J.F. Evaluation ol medical students 
during a clinical clerkship in internai medicine. Journal o f 
Medical Education / 1978/ 53/ 55-5&. ^ 

Pierieoni/ R.G.y Clark/ G.H., and Dudding/ B.A. A coiaparison of 
faculty/ resident/ and nurse practitioner ratings of ambulatory 
pediatric studv^.nts. Presented at the Annual Meeting of the 
American Educational Pesearch Associationy San FranciscO/ 1S79. 



Cason and Cason: Perlorraance Rating 



Page 34 



Printen/ K.J*, Chappell, and Whitney, D.R. Clinical 

perXormance evaliiation of junior medical studtints^ Journal of 
Medical Education , 1973, 48, 343-348. 

Rasch, G. An item analysis which takes individu?! differences 
Into account*. British Journal of Mathematical and Statistical 
Psychology, 1966, 19, 49-57. 

Reraojers, Shock, N.vJ., and Kelly, £.L. Ah eirpirlcal study 

of the vaidity of the Spearman-Brown forraula as applied to the 
Purdue Rating Scale. Journal of Educational Psychology , 1927, 
18, 187-195. 

Schumaker, C.F., at al« Applying the Rasch model to equate 
examinations in the field of iiiedlcine. Presented at ^he Annual 
Meeting of the American Educational Research Association, San 
Francisco, 1979. 

Sheehan, J«T«, Husted, S.C.R., Candee, D«, Cook, CD., and 
Bargen, Moral judgement as a predictor of clininal 

perf oripance. Evaluation and the Health Professions ^ 1980, 3(4), 
393-404. ' ^ 

Smith, H.A., and Kifer, E» Student evaluation in an externshlp 
utilizing the Rasc^ model for test calibration. American Jourr/al 
of Pharmaceutical Education , 1980, 44, 6-H» 

Smith, and Kendall, L» Re translation of expectations 2 An 

approach to the construction of unambiguous anchors for rating 
• scales* Journal of Applied Psychology^ 1963, 47, 149-155. 

Snedecor, G.H. Statistical methods . (4th Ed.). Ames, Iowa: 
Iowa State College Press, 1946. 

Stanley, J.C. Analysis of unreplicated three-way classifications 
with applications to rater bias and trait independence. 
Psychometrika , 1961, 25(2), 203-219. 

Sternberg, S. Stochastic learning theory. Tn R.D. Luce, R.K. 
Bush and £• Galanter (Eds.j iiandbook of Mathematical Psychology ^ 
Volume I I. New York: Wiley, 1967". 

Stillman, P-L. Arizona Clinical Interview i^edical Rating Scale. 
Medical Teacher , 1S60, 2(5), 248-251. 

Stillman, P.L., Brown, D.R., Redfield, D.L*, and Sabers, D«L. 
Construct validation of the Arizona Clinical Interview Rating 
Scale. Educational and Psychological Heasur€r?ient ^ 1977, 3^, 
1031^1038. 

Symonds, P.M. Diagnosing personality and conduct. New York: 
Century, 1931. 

Hard, J«, and Jennings, E« Introduction to linear models . 
O Englewood Cliffs, H.J.: Prentice-Hall, 1973. 



Cason and C^son: Ferfornance Rating Page 35 



Wright^ fl.D. Saraple-frce test cailbrstioil and person 
measurement. In Proceedings of the 1967 Invirational Cnnferencft 
on Testing Problgirs ^ Princeton^ N. J* : Educational Testing 
Service, 1968. 

Wright/ B.D.z and Stone, M.K. Best test design. Chicago: WfclSA 
Press^ 1979. 



AcknowiledgeBPents 

We gratefully acknowledge the encouragement, assistance, and 
co-operation of :^ George AcJcecman, Harry Ackerman, Jerry 
Blackburn, Roger Bone, Tom Bruce, John Delk, Ross Dykman, Ron 
Hale, Lisa Hale.r Peter Kohler, Tom Lewis, Toic Monson, Jim 
Phillips, Bob Shaiiron, Lois Tipton, and Ture Schoult:?:. 

This research was supported in part by the taxpayers of the 
United States thrcugh grant No* 9OAL0005-O1 from the Department 
of Haaith and Kuma^j Services. 



Q ^ 



^1' 



U 2 
7» 



• •! « «t « ^ « * 



• • I, « t a I • 1 f J > I fi I n i 



* (Ml M % 11 I • <1 I J ) f (I k It % M 



• i> II w k d t <t •( W 0 1 I).* ' C 



list SCiUAr^ii^HT 

UAikl ^1) M(IH( TitAN 
liM lUhk Khi lit VI 
ItlASi roMr-UfUt 
TUCUNHICI 



4 * M n 1^ b 11 J - 1 v ] r tj i^iU ^ 



••UUWk|JX«^|J]|0V|il«l| 




S a Sii|iiliiit>illr_l>itlir |r i , wnulii bn ni i(t|> I0'4 it mN^fl clai«t 

4 S A Mil* liltllf 11 ■ , t»bMM li» m tup hut lialHw in, 1(1' 1.1 IfHil (Uifj 
) £ No bvltit «wu ti f . wniJi) III m mnl^i bt)). ul Inct it tttiij 
? : A Mil- worii |i • , MtMilJ U «i lHiil»(t» \ \m\ sUtm Ws\U}H, I tJ!- ul l||vt(»| clMij 
1 1 SfdiiiiiltiUy II |ri , Mjtiljl bf in UiivtH 10 V vl Uf^t <l iliitj 



inlfpietlt 



mmmmxmm 
cEmfcoamfsKiiLS: 



t a NmI AhiIii*Ui Y s Hi IkI ttir»»ii Not Obiinitii 



\ ^ t } f \U ; 



Undernan'Ung facts, rules 



M < ' 2 I 



_Apply|n9 JacUj^ryles, etc ^ ». ' " » 

Probft'rolvTn'gT'analysfs^ 
_ synthesis evaluation 

J^?rL(JJLM?l?ti)denjtsJ 



O 4 1 M 



^ i 4 ) 2 t 



Patients 



faculty ' • ^ ' > 



Residents ; 



Jllnlcal Team: RNsJechs,8tc ' • ' ' ' ' 



Peers (JrHed Students) 



Patients 'J^'^" 



Jacult y 



t2 I 4 } I I 



Residents 



IJ I 4 ) M 



Clinical Team; RNsJeciis^etc'^ * ' 



) 7 \ 



k 4 ) I t 



Implicit responsibilities 



) 2 \ 



IM 4 J 2 I 



Being corrected 

mcmm'^mn^up^^ 

Conducting History i» » < > »» 



Conducting Physical Exam 



19 I 4 2 2) 



Recording History i. 



}J k 4 2 2 I 



UAM^SIunifll nilFOllMANCE IMTING FOflM 



..'i?p9r?i.!og.fhy}iciL^!i?'L 



Requeuing Studies/Tests 



Jeques^ing Consults 



IltKPritjfigJJstoryJwilltL 



24 I 



ihysjcaltxesL 



2b I 



Jtudijs/IesjL 



36 1 



Consult Results 



Synthesizing Problem/ 
fopUtlng Diagno$i5_«^ 

mmmic oEsiGii/pumms: 
.Selecting/.fomiuatingJrfiatinei 
fianual Skills i 

_E)(e?utjno.prpcedure5^,u>u. 



RATER'S COMMENTS 
I'LCASt r'N ttfl ANY COMMENTS VCl) f£FL ARF HfUVAM 10 THE EVAUJATtON Of 
SrUUENi WHflStHAriNCSYOUENIfllfOONUUVLIiiiEOfTHlSfOHM.COMMENrS 

ON THIS bluDiNrii spccinc sifirNnrHS mm wiakn'Fsijis and doc- 

UMIMAIION fnn RATlNCS ASSlfiNtO AIIL MUSr USHlDv 
MAHNGS™ CACllirFMONOflVl list Mur,( lit iNTtHtD.COMMENTSAHEONlY 
A5UPPLEMLN1 NOTASUUSTITUIE i^OlfnAllNGS. 

PWISIOIiAL OVERALL GRADE: In marking item 34 on the obverse, use 
the definitions for 5, 4, 3, 2» and 1 given below. For all other 
itemSj use the definitions provided above th'e items on the obverse 
of this fonii. 

5 • A ■ OUTSTANDING overall perfomance for fed, Schocl Jr. 
4 • B « AOOVE AVERAGE ' " forHed. School Jr. 
3- C -AVERAGE " " for Ked. SchoolJr. 
?• D • BELOW AVERAGE " ' 'or l^ed. School Jr. 
1 • F - UNSATISFACTORy " - for Hed School Jr. 



Follow-up. evaluation, revision ji 
_0Ltre3tmenLregiiDen^,:..u-tu4^ 



A-5 / D-i / CO / D-2 / F*l " ' 



3/ I 



If N0COMM']NTS,CliECKH£RE Q 



HAUflSSlCNAtUfll 



OAU 



IIAUMSNAMf ttP(l)OMrfllN}(U 



38 I 4 I } 1 U V 



■4*2 J 



40 I 4 ' ' J* ^ 



IIAUn SIGN AND COMMENT ON U\imi KIOL 



EKLC 



33 ^ 



Cason and Cason: Forformance Rating 



Page 25 



to free paraiaeters (to be estimated for Model A) was 1^8, 
2.1^ and 2*7 respectively in the three cohorts. In the 
restricted data set/ these ratios declined to 1.4/ 2.2/ and 
2«2<» In cohort 1978-79:3 t5ie ratio fell from an already 
marginal 1.8 observations per parameter in the full data set 
a very doubtful 1*4 in the restricted data set. A low 
ratio could place the proposed raodel at a disadvantage 
because it had more parameters to be estimated* Less data 
per patamtcr would reduce the accuracy of the parameter 
estimates and thus the accuracy of the model's predictions. 
The alternate model having only about half as aany 
parameters to estimate had an advantage in obtaining more 
accurate estimates of its parameters (i.e./ one mean per 
student)- 

Table 5 reports the results of correlating an 
independent rating of each student with the prediction of 
the proposed model (A) and the prediction Implicit in the 
common practice of taking the unweighted mean of the 
observed ratings (Model B) as the best available measure of 
performance. In two of the three cohorts the results appear 
to favor the proposed model/ but in cohort 1978-79:3/ Model 
B seems to be superior to the proposed siodel. This means 
that in two of the three cohorts predictions based upon a 
knowledge of both rater leniency and student performance 
(i.e./ ^^odel A) appeared superior to a knowledge of student 
perf orniarice alone (Model B). In cohort 197 8-79: 3^ the 
prediction of Model A was not only less accurate (i.e./ less 
well correlated with the criterion)/ the observed 
correlation (r=0o26) for Model A was not significantly 
diffetent frcR r=OoO* Considering that J«odel B was a 
restricted case of Model A> Model A should do no worse than 
Model B. 



Table 5 

Correlations of Prediction of i^odels A and B with an 
independent Rating on Each Subject 



8 

0.5020 
0.5531 
0.2022 
0.4465 
0o4055 



ERIC 



Cohort A 

78-79:3 0.2555 

78- 79;4 0-6699 

79- 80J2 0.4027 
Mean 1 0.5136 
Mean 2 0.6128 
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For cohort 1978-79:3, the data indicate that very poor 
estimates for Wodel A^'s parameters were obtained from the 
restricted data set. The result ot Model A fitting worse 
than Model E was directly attributable to the lack of 
sufficient data in the restricted data set for 
simultaneously estiioatlng SAP's and RRP*s. This "negative 
finding" was serendipitously suggestive of a useful rule of 
thumb. Anytime the correlation between the proposed model's 
predictions (when based upon the parameter estimation 
procedures in MERLIN) and independent cr iter Ion ratings 
fails to at least equal the correlation between the 
criterion and each subject^'s mean observed rating (l^e./^ tiie 
prediction of Model B)/ then there are insufficient data 
avaiiaJble to make useful estimates of the parameters of the 
Model A. In t^e case at issue, this Interpretation was 
corroborated by an analysis of correlations between the 
values estimated tor Model A'^s parameters (i?RP*s and SAP's) 
based on the full data set with estimates for the same 
parameters based on the restricted data set* The results of 
these analyses are reported in Table 6. 



Table 6 



Correlations between Parameters Estimated from the Full 
Data Set with Those Estimated from the Restricted Data Set 



Cohort 
78-7953 

78- 79:4 

79- 80:2 



RRF 
0.83G0 
0.9173 
0.8676 



SAP 
0*7991 
0.9691 
0. 9625 



Both 
0.8178 
0.9508 
0.9329 



RSQ 
0.6688 
0.9040 
0.8703 



These correlations would be high if the parameter 
estimates were stable. The correlations for RRP's and SAP's 
separately ard combined Ind icated that there was good 
stability for the parameter estimates in cohorts 197 8-79 :4 
and 1979-80:2. Taking the sguare. of the correlation (fiSG) 
between the two conditions (i.e., full and restricted data 
sets) as a measure of common variance/ the stability of 
cohort 1978-79:3's parameter estimates was clearly poor 
(RSQ=0.67). Deleting one observation per student produced 
substantially different parameter estimates. Better 
estimates could not be had from less data> therefore, the 
estimates from the restricted data set must have been 
substantially worse than from the full data set. It is 
important to emphasize the extreme conditions under which 
the parameter estimation procedure failed. Complete data on 
the cohort would have contained: 47 raters x 29 students = 
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1363 observed ratings* In the reduced data set there were 
107 observationsa In other yords, 7*85% of the possible 
data were present and 92*15% of the data were missing from 
the observed data table input to NERLIN* In the other two 
cohorts, the respective data tables t?ere 17.74% and 17.04% 
complete. 

Kith the clear evidence that it was the parameter 
estimation process rather than the proposed model that 
failed and that the failure was due to lack of sufficient 
data to make useful estimates of the proposed model's 
parameters, we reconsidered the results reported in Table 5. 

Means 1 and 2 were computed using the wei</hted r to z 
mean correlation procedure recommended fay Mcileraar (1966# p. 
139). The aean correlation between Model A and an 
independent criterion (i.e., the saved ratings) across all 
three cohorts (mean 1) was higher than that obtained by 
Model B, but not significantly higher (p>0.15). However, 
ample evidence had been found which required the exclusion 
of the 1978-79:3 data from this comparison- Therefore, Mean 
2 was calculated only upon the results for cohorts 1978-79:4 
and 1979-80:2. This resulted in r=0.62 for the proposed 
model, while the mean correlation between the criterion and 
Model B predictions was r-0.41. Each of these correlations 
was significantly greater than r=0.0 (p<0.004). Further, 
the proposed model predicted the criterion significantly 
better (z=2«62> p<0.004) than did the alternative model. 
This result directly validates the theoretical constructs of 
both rater leniency and subject achievement. 

Model A's predictions correlated higher with the 
independent criterion ratings, r=0.61, because Model A^s 
predictions were more nearly valid. The raw ratings 
contained two components: subject achievement and rater 
leniency. As measures of true subject performance, the raw 
ratings were cpntaminated with rater leniency and were 
therefore less valid and reliable measures of true subject 
performance. The reliability of r=0.50 for raw ratings 
reported in earlier work (Cason and Cason, 1979) was an 
overestimate because it did not take the leniency effect 
into account. The best available estimate for the 
reliability cf raw ratings as measures of performance alone 
was the mean correlation between Model B and the criterion 
ratings in the last two cohorts (mean 2): r=0.41. Our 
model attained higher correlations with the , criterion 
because it explicitly used both rater leniency and 
achievement data to make its predictions. The model 
depicted the data more validly than could the ni6an of raw 
ratings in incomplete data sets. Therefore, the best 
available measure of student performance or student 
achievement was the rating that our model predicted a , rater 
of average 1 en i:^ncy would assign a given subject (or, its 
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equivalent on the latent scale, this subject's SAP)« 

Applying our model, the reliability of a single rating 
as a fl!easure of true i^erf orniance was r=0«6i. Leniency 
effects had been removed; therefore, Spearman-Brown's 
formula was appropriate to conservatively estimate the 
reliability of a rating based upon several independent 
raters. Specifically, our model's predicted mean rating for 
each Subject based on 5 ratings had an estimated reliability 
of r=0«89» By the same logic, the reliability of the mean 
of 5 raw ratings as a measure of true performance was 
calculated taking r-0«41 as the reliability of a single raw 
rating. Applying Spearman-Brown's formula, this gave r=0«18 
for the reliability of the mean of 5 observed ratings as a 
measure of student true performance. Because validity 
cannot exceed reliability these results clearly Indicated 
our model could produce substantially more nearly valid 
measures of student true pecfo^^a^^e from an incomplete data 
table than could the mean of observed ratings on each 
student* 

Conclusions and Implications 

All the a priori objectives of the research were 
attained* With respect to clinical performance rating data 
sets of a type which are common to health pr fessions 
education (i*e», dirty and Incomplete), the propo^;;d model 
was empirically demonstrated to have: (a) closely fit the 
data (p<0. 000001), (b) clarified and quantified the separate 
contributions of rater leniency and subject achievement 
(e«g», 20% and 35% of variance accounted for respectively in 
these data, eitpirical cross-validation of both constructs, 
and so forth); and, (c) provided a usable mechanism for 
generating more reliable and valid ratings-based measures of 
clinical perforaance as indicated by the reliability of 
r=0o 89 (basec on 5 indep endent ratings) attained from 
application of . the proposed model as compared to r^O^^O 
attained for the most commonly used current alternative, 
i*e», the mean of the 5 observed ratings. 

The results clearly demonstrated the superiority of tte 
proposed model when data sets were incomplete and subjects 
were rated by unrepresentative subsets of raters. In 
addition, an empirical method for judging the adequacy of 
the data for the application of the model was demonstrated. 
When the proposed model failed to provide fit with the data 
at least as good as the mean of each subject^s observed 
ratings, the data set was insufficient to provide adequate 
estimates of the proposed raodel^s parameters. Nevertheless, 
the proposed model provided improved measures of performance 
when the data set was as little as 17% complete. 

The conditions of the tests contrasting the proposed 
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model (A) with the mean of the observed ratings were biased 
against the proposed model. Assignment of students had been 
random so variation of average leniency In rater subsets 
would tend to be small* This tended to reduce the rater 
main effect in these data. In settings where non-random 
assignment occurs, larger discrepancies in mean rater 
leniency coulc easily occur. In such settings^ the power of 
the proposed rrodel in producing more valid measures would be 
even more pronounced^ Assuming the proposed model was 
valid/ Table 1 provides a "worst case" example^ of the 
potential impact of rater leniency upon ratings received by 
students. This example was based on the extreme (lenient 
and stringent) raters and ejctreme {low and high achieving) 
students In cohort 1979-80:2. The top row depicts the 
ratings that the most stringent rater would assign; the 
bottom row the most lenient. The left column gives the 
corresponding rater reference points for the two raters. 
The middle column gives the expected rating for the low 
achieving student; the rightmost column, the expected 
ratings for the fcigh achieving student. Both the raters see 
the high achieving student much the same; there is only a 
10% difference in ratings. But, the low achieving student 
is preclicted to receive drastically different ratings. 
There is a 30? difference in ratings. Predictions rather 
than observed discrepancies were used in the illustration 
because it was the model that was validated in this 
research. Whether discrepancies as Ittrge as this occurred 
in this data has a chance matter. The model's predictions 
were a better general indicator of the possible magnitude 
than coincidertal data because the laodel captured a set of 
relationships In uhole data sets* 



Table 7 

J^axirauu Effect of Rater Leniency on Predicted 
Student Ratings in Cohort 79-80^2 



Low Student High Student 
RRP (SAP 497.9) (SAP 653.7) 



Stringent rater (534.9) 35.59% 88.27% 

Lenient rater (452.1) .67.65% 97.81% 



In spite of the consistency/ strength^ and coherence of 
the results supporting the proposed model found in these 
data, these data were limited. Only one setting/ an 
internal medicine clerkship was represented. Only one 
rating inventory. was used. Still 75 different raters were 
involved and 94 different students were rated. It would not 
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be prudent to conclude that the proposed model will fit 
every conceivable performance rating setting* Neither would 
it be reasonable to ignore the strength of the results ttom 
these limitec data# There are too many coraraonali ties 
between these data and many others not to expect that this 
model may prove very useful in a wide variety of settings 
and contexts* 

Extrapolating optimistically from these early^f 
promising results, a nuraber of useful possibilities occur to 
us* Our model might meet Heskauskas and Norclni's (1980) 
requireujents for a methodology for "handicapping" judges in 
both standards setting and performance assessment procedures 
better than do Stanley *s (1961) methods. Our results 
suggerit that in some settings rater leniency may not be 
suf JL iaiently stable to use Stanley''^ methods* However/ 
because our model can be applied to incomplete data sets, it 
provides a means of "adjusting" judges' ratings on the bases 
of their current behavior rather than on their past ratings* 

An intriguing possibility is the application of our 
model to the problem of assessing the test Items in a large 
item bank* Sone test item banks now have thousands of itescs 
in them. Bvty these items are not equally relevant to the 
objectives of specific training programs which may use these 
test item banks. Our model would permit a more uniform 
standard to be applied in judging the difficulty or 
relevance of items in the item bank while reducing the 
extent to which redundant judgements were required. For 
example, our model might permit judges to consider only 
slightly overlapping subsets of Items uhile applying 
Angoff's fA911) or a similar standards setting method. The 
judges'' judgements could be calibrated through the common 
items that they judged. This would permit a small number of 
judges (e.g./^ the faculty in a department) to evaluate a 
larger item bank without either taking years or imposing an 
unrealistic burden on the individuals. 

Our model provides a technique whereby it would be 
possible to "track" the rating performance of individual 
raters and provide thea with feedback on how their ratings 
compared with other raters in settings where not all caters 
rate all subjects. This might even be useful in settings 
where raters had been trained to a very high level of skill 
so that only few raters would rate each subject. So long as 
there were adequate pj/erl^^ ratings, the model would 

provide a way of monitoring raters that was non-intrusive 
and inexpensive since it requires only their routine rating 
data. 

There are at least two general ways In which our model 
may prove to be of research interest. First, the model 
Itself, in so far as it is a simplification of a somewhat 
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more elaborate theory^ deserves inv^^t^i^^ti^n* Perhaps 
incorporation of differential rater sen^i tji\/4,ty-^ explicit 
repr esentatior of problem or situation ^Mtic^^^Y ^ or other 
elaborations of the proposed model woul^ J^gd to further 
improvements in ratings-based measur^^ ot complex human 
perf ormance. However, such elaborat i^fi^ wi?uid involve 
adding parameters and this would x^^y^i^e more nearly 
complete data sets if useful estimates Parameters 
were to be achievable. In spite o| tH^ success of the 
simplified model in fitting and explaintj^^ th^ relationships 
in these data, the model is a gross sijjj^Ut i^ation of even 
the rudimentary performance rating th^^ty ihat we have 
proposed. 

Second, the proposed model may be n^^jul an analytic 
method in research involving complex h^^J^^ Performance as 
either a criterion or predictor vari^fcl^^^ With notable 
exeptions such as Sheehan, Husted, C^ncJe^^ Cook, and 
Sargent's (1980) report, prior investigations of the 
relationships between complex performance v^ti^bles (such as 
clinical perfornance) and variables ig^^^^r^^ by more 
reliable methods (such as objectively ^cat^d aptitude and 
achievement tests) have found only very m^tJ^^t ^elationi'hips 
or none at all. This csay have arisen in ^^ft pecause of the 
relatively low reliability and/or valldtt/ ot the available 
ratings-*based measures of complex gexiot^^^tiC^^ The proposed 
model may have a substantial contribution^ to tn^^e to these 
invefitigatlons by providing a way to jd^xe nearly valid 
and highly reliable measures of corapl^^ p^^i^twance than 
have been available in the past* tuts prospect is 
especially exciting for those areas of (j^Cf ottnance where 
there are already large but dirty and ^f^KiO^pX^te data sets 
available and/or those areas which, for ^t^Cti^al reasons, 
may be unable ta concurrently prod^c^ both clean and 
complete data sets regardless of the reso^^^^^s Available* 

While it is desirable that the judg^j^^nt^ individual 
judges be made as reliable and valid as p^js^ible, there 
will almost certainly always be more ^^^e^^t^^^t programs 
that generate incomplete, dirty data ^^t^ than complete, 
clean ones. The inodel we have presented h^r^ shows real 
promise for improving the quality (?X it^^ assessment 
information that mj^iy be extracted undej; t^e^^ iess than 
ideal and unfortunately common circumstat^^^^^^ 




Cason and Cason: Ferforiaance Rating 



Page 32 



References 



Anderson/' DaQey Baker,^ H«ll«^ Lagunay J«E«y and Laguna^ J«F« 
Applying the Rasch model to improve health science clerkship 
evaluations* Presented at the Annual Meeting of the Rocky 
Mountain Educatioral Research Association, Las Cruces, N. 
1980. 

Angoff, W*H. Scales^ norms and equivalent scores. In R«L. 
Thorndike (Ed.) Fcucational f^easur ement (2nc ed.)« Washington^ 
O.C.s American Council on Education/ 1971. 

Raker, F.B. Advances in item analysis. ReviiitM of Educational 
Research, 1977, 47, 15l-i78. 

Cason, G.J. HERLlNs A FORTRAN IV program for finding 
least-squares estimates of rater reference points, subject 
achievement points, and goodness-of-f it for Cason and Cason's 
model of performance rating. Copyright 1980 by Gerald J. Cason. 
(Available from author.) 

Cason, G.J., and Cason^ C.L- Rating students' clinical 
performance: Intcriir report number 2. Presented at the Annual 
Meeting of the Mid-South Educational Research Association, Little 
Rock, Arkansas, 1979. 

Chandler, J.P. STEPIT: A FURIRAN II subroutine for finding 
local XDiniiaa of real functions. Copyright by J.P. Chandler. 
(Available from Cuantura Chemistry Program Exchange, Indiana 
University: Bloomington, Indiana* ) 

Cromier, G. A stucy of the applicability of a truly objective 
model In medical education. In Proceedings of tae Sixteenth 
Annual Conference or Research in Medical Education . ^^ashington, 
D.C.: American Association of Medical Colleges, 1977, 123-128. 

Davldge, A.M., Davis, W.K., and Hull, A<»L. A system for the 
evaluation of medical students' clinical competence. Journal of 
Medical Education, 19S0, 55, 65-^57. 

Davis, W.K., Hull, A.L., Davldge, A.M., and Dielman, I.E. 
Variables influencing ratings of medical student's clinical 
performance. Presented at the Annual Meeting of the American 
Educational Research Association, San Francisco, 1979. 

Dielman, T-E., Hull, A.L., and Davis, S^K. Psychoroetric 
properties of clinical performance rating. Evaluation and tiie 
Health Professions ^ 1980, 3(1), 103-117. 

Ebel, R.L. Estiaation of the reliability of ratings. 
Psychometrlka , 1951, 16, 407-424. 



ERLC 



34 



Cason and Cason: Performance Rating 



Page 33 



Harabletonr R*K*^ Suaaiinathan/ Cook/ L.L./ Eignor/ DoR^/ and 

Gifford, J. A. Developments in latent trait theory: Hodels/ 
technical issues/ and applications. Review of Sduca tional 
Research / 1978/ 48/ 467-510. 

Harasym/ k comparison of the Nedelsky and modified Angoff 

standard-setting procedure on evaluation outcome. In Proceedings 
of the Nineteenth Annual Conference on Research in Medical 
Education . Washington/ D.C.s American Association of Medical 
Colleges/ 1980^ 3-8. 

Hughes, F.P. The Rasch model applied to the equating of several 
examination forraso Paper presented at the Annual Meeting of the 
American Educational Research Association/ San Francisco/ 1979. 

Kreines/ D.C*/ and J^ead, R»J, Equating tests with the Rasch 
model. Paper presented at the Annual Meeting of the American 
Educational Research Association/ San Francisco/ 1979. 

Landy/ F.^ and Barnes/ J. Scaling behavioral anchors. App lied 
Psychological Measurement / 1978^ 3(2)/ i93--200. 

Lord/ F.M. A theory of t^^st scores. Psychometric Monographs / 
1952/ No. 7. 

Lord/ F.M. An application of confidence intervals and maximum 
likelihood to tie estimation of an examinee^'s ability. 
Psychometrika / 195 3/ 18, 57-75. 

WcNeraar/ G. Psycto logical statistics C3rd Ed*). Heu York: 
Wiley/ 1966. 

Head/ R.J./ Wright/ B.D./ and Eelly S.ii. BICAL- Version 3. 
Computer program to perform Rasch item analysis. Chicago: 
University of Chicago/ 1979. 

Meskauskas/ J. A./ and Ncrcini/ J.J. Standard-setting in written 
and interactive (oral) specialty certification examinations: 
Issues/ raodels^> methods/ challenges. Evaluation and the Health 
Professions ^ 1980/ 3(3)> 321-360. 

Nedelsky/ L. Absolute grading standards for objective tests. 
Educational and Psychological Measurement / 1954/ 14/ 3-19. 

Nunnally/ J.C. Psychome trie theory. New Xorks McGraw-Hill/ 
1967. 

O'Donohue/ M.J.y and Sergin/ J.F. Evaluation ol medical students 
during a clinical clerkship in internai medicine. Journal o f 
Medical Education / 1978/ 53/ 55-5&. ^ 

Pierieoni/ R.G.y Clark/ G.H., and Dudding/ B.A. A coiaparison of 
faculty/ resident/ and nurse practitioner ratings of ambulatory 
pediatric studv^.nts. Presented at the Annual Meeting of the 
American Educational Pesearch Associationy San FranciscO/ 1S79. 



Cason and Cason: Perlorraance Rating 



Page 34 



Printen/ K.J*, Chappell, and Whitney, D.R. Clinical 

perXormance evaliiation of junior medical studtints^ Journal of 
Medical Education , 1973, 48, 343-348. 

Rasch, G. An item analysis which takes individu?! differences 
Into account*. British Journal of Mathematical and Statistical 
Psychology, 1966, 19, 49-57. 

Reraojers, Shock, N.vJ., and Kelly, £.L. Ah eirpirlcal study 

of the vaidity of the Spearman-Brown forraula as applied to the 
Purdue Rating Scale. Journal of Educational Psychology , 1927, 
18, 187-195. 

Schumaker, C.F., at al« Applying the Rasch model to equate 
examinations in the field of iiiedlcine. Presented at ^he Annual 
Meeting of the American Educational Research Association, San 
Francisco, 1979. 

Sheehan, J«T«, Husted, S.C.R., Candee, D«, Cook, CD., and 
Bargen, Moral judgement as a predictor of clininal 

perf oripance. Evaluation and the Health Professions ^ 1980, 3(4), 
393-404. ' ^ 

Smith, H.A., and Kifer, E» Student evaluation in an externshlp 
utilizing the Rasc^ model for test calibration. American Jourr/al 
of Pharmaceutical Education , 1980, 44, 6-H» 

Smith, and Kendall, L» Re translation of expectations 2 An 

approach to the construction of unambiguous anchors for rating 
• scales* Journal of Applied Psychology^ 1963, 47, 149-155. 

Snedecor, G.H. Statistical methods . (4th Ed.). Ames, Iowa: 
Iowa State College Press, 1946. 

Stanley, J.C. Analysis of unreplicated three-way classifications 
with applications to rater bias and trait independence. 
Psychometrika , 1961, 25(2), 203-219. 

Sternberg, S. Stochastic learning theory. Tn R.D. Luce, R.K. 
Bush and £• Galanter (Eds.j iiandbook of Mathematical Psychology ^ 
Volume I I. New York: Wiley, 1967". 

Stillman, P-L. Arizona Clinical Interview i^edical Rating Scale. 
Medical Teacher , 1S60, 2(5), 248-251. 

Stillman, P.L., Brown, D.R., Redfield, D.L*, and Sabers, D«L. 
Construct validation of the Arizona Clinical Interview Rating 
Scale. Educational and Psychological Heasur€r?ient ^ 1977, 3^, 
1031^1038. 

Symonds, P.M. Diagnosing personality and conduct. New York: 
Century, 1931. 

Hard, J«, and Jennings, E« Introduction to linear models . 
O Englewood Cliffs, H.J.: Prentice-Hall, 1973. 



Cason and C^son: Ferfornance Rating Page 35 



Wright^ fl.D. Saraple-frce test cailbrstioil and person 
measurement. In Proceedings of the 1967 Invirational Cnnferencft 
on Testing Problgirs ^ Princeton^ N. J* : Educational Testing 
Service, 1968. 

Wright/ B.D.z and Stone, M.K. Best test design. Chicago: WfclSA 
Press^ 1979. 



AcknowiledgeBPents 

We gratefully acknowledge the encouragement, assistance, and 
co-operation of :^ George AcJcecman, Harry Ackerman, Jerry 
Blackburn, Roger Bone, Tom Bruce, John Delk, Ross Dykman, Ron 
Hale, Lisa Hale.r Peter Kohler, Tom Lewis, Toic Monson, Jim 
Phillips, Bob Shaiiron, Lois Tipton, and Ture Schoult:?:. 

This research was supported in part by the taxpayers of the 
United States thrcugh grant No* 9OAL0005-O1 from the Department 
of Haaith and Kuma^j Services. 



Q ^ 



'J 

■y. 



:-» ^ . 



-ii |.W - 



k »j » 

Ik O I - 



* -* > * tl » 



.• 4 - 
> « « 



use scaiE <t«{iHi 
ust sot I oi scit 

MA»I Hill MiiNf THAN 
ll^l i4«Hk Pt II lUM 
ItlASi l'l<Mr«lLmV 
TU CUHHlCt 



> » - 

- » * , 



n K b c X - 



S a S(i|i»t*n*i*llr liatlai |i * . itnutii h<* m tAp tVib ai l||Mbal cliii»i 
3 — Nri bvflai w« Msria {i a . iwn>»lil lia in innliHa ^U'« wl Inni al i lai»| 



_ xxxxxxxxxxxxxxxxxxxx 

CENmrcOCf/lfjVE Sk^CLS 



7 a A I'lll- Mdiaa |i a , M*n«li> !»• i»» h<ril*n> jvx- Uit <1»f»« ImiII<»mi lU^i wl tyiMial clatll 
1 V !iwli<i«iiliaUy ivotia |i a . M^Mtbl ba in buiiuiti 10 V wl trfui al tin it) 

T 



in Trixcal • 



trfMi al 4iait) 

* » Nwl AtttiliyaMit Y a Ntii Hcivil tttrai»a Nvi Obitrvail' 



Underston'JIng facts, rules 



\ % t i f t ^ 



2 t a } I t 



Problein'Sol ving: 'analysfs, 
synthesis^evaluatlon 



COimtilCATIOti (wit-:/: 
Pee r&^(J_r Hed^S t udentiL 

Patients 



b » 4 > I t 



8 » 4 ) I t 



Faculty ' 



14 9 11 



Residents 



B • 4 ) I I 



_ClinkalJeam: RNs, Techs, etc ? ' ' > ' 
ATTITUDE (towaVdT:' 

Pee r s (Jr M ed Students) 



\Q I t > t i 



Patients 



li I 4 9 2 t 



Faculty 



I? I 4 ) I t 



Residents 



IJ I 4 } I I 



Clinical Team: RNs , Techs , etc ' * 



Implicit responsibilities 



)S b 4 ) I I 



16 I 4 > > t 



IM 4 :^ I I 



Being corrected 

'B2'STC' PATIENT ' "WQBK^UP" 

Conducting History js i 4 > « » 



j C onducting ^hysJ[cal_Exam 
I Recording History 

U A M i SI U 0 4 N iV tn f 6 M M'ANci'iTAT I N C FOMM 



19 I 4 i S t 



2a I 4 I 3 t 



— Rep9r?>.1 ng . Fhy5 1cal_Cx^m 
Requesting Studieji/Tests 



y j P 9 J °PJ y Its ' 



lusmn^t j ng;-Hi 5 tonLBe^uitt, 



studies/Te sts 



Consult Resul ts 



Synthesizing Problem/ 
../ormulatlng Diagnosis. 



THERAPEUTIC DESIGN/PBOCEDUItES: „ , 

Selecting/fonnulating^treatinent 

Manual Skills & 
Ex.eguting. procedures^ 



Follow-up, evaluation, revision 3, , 
_o.f „t re 3 tme n t. . re g 1 men«,....^^4.,.i 

PEHFOmmCE UNDEB STRESS . ^' ' 



POTENTIAL FOR ADVANCED TRAINING " ' 



PROVISIONAL OVERALL GRADE 

A"5 / D-4 / C-3 / D-2 / F«l " ' 



3& I 



35 I 
3f I 



3a 



RATER'S COMMENTS 

IM.CASt rNf£H ANY COMMENTS YOU f CEL ARP MFLEVAr. 
SrUDEN. WMOSt HAriNCS YOU ENl f 1IC0 ON OU VlliliE U 
ON IMJS bUlDI NT Jj specific SinfNr.fHS ANO/OM 
UMl NTA!M;n for ratings ASSiONCIt AltL MUST UStl 
HAIINGS fnit CACll ITFM ON DRV( HSL MUf.! Ilk I NTtH 
A iUPPLEMLNl NOT A SUUSTITUTE ^OlfflA'^INGS. 

PROVISIONAL OVERALL GRADE: In marking Item : 
the definitions for 5, 4, 3, 2, and 1 given t 
items, use the definitions provided above thj 
of this form, 

5 ■ A ■ OUTSTANDING overall performanc 

4 ■ B - ABOVE AVERAGE " 

3 - C - AVERAGE 

? ■ D - BELOW AVERAGE " 

1 - F ■ UNSATISFACTORY " 



If ND COMMENTS. CHECK MERE □ 



HAtin S StCNAKi 



IIAILM S NAMf T1 



3S I 4* J I I 



40 B 4 1 I 



ItATCn SIGN AND COMMENT ON HevLn:>e KIOL 



33 



